walkup page tables on ARM - linux-kernel

I'm trying to emulate the function lookup_address(http://lxr.free-electrons.com/source/arch/x86/mm/pageattr.c#L373) for arm platform (and just for kernel pagetables).
The point is I'm getting the address of swapper_pg_dir from TTBR1, and so far that's working.
I checked it with gdb:
(gdb) file vmlinux
Reading symbols from vmlinux...done.
(gdb) p init_mm.pgd
$1 = (pgd_t *) 0xc0004000
(gdb)
and the code from my module:
static pgd_t *get_global_pgd (void)
{
pgd_t *pgd;
unsigned int ttb_reg;
asm volatile (
" mrc p15, 0, %0, c2, c0, 1"
: "=r" (ttb_reg));
ttb_reg &= TTBR_MASK;
pgd = __va (ttb_reg);
pr_info ("get_global_pgd: %p\n", pgd);
return pgd;
}
and the output:
bananapi kernel: [ 5665.358139] mod: get_global_pgd: c0004000
So far, this is matching.
Now I'm computing the addr of the right pgd, doing:
pgd = get_global_pgd() + pgd_index (addr);
And since (addr >> 21) is 0x600, I get 0xc0007000.
Then I continue with:
pud = pud_offset (pgd, addr);
pr_info ("pud: 0x%0x - %p\n",pud_val (*pud), pud);
pmd = pmd_offset (pud, addr);
pr_info ("pmd: 0x%0x - %p\n", pmd_val (*pmd), pmd);
if (pmd == NULL || pmd_none (*pmd)) {
return NULL;
}
return pte_offset_kernel (pmd, addr);
output:
bananapi kernel: [ 5665.390391] mod: pud: 0x4001140e - c0007000
bananapi kernel: [ 5665.401603] mod: pmd: 0x4001140e - c0007000
bananapi kernel: [ 5665.423838] mod: pte: 0xe59f119c - c0011020
The problem is that the pte I get seems not to be fine, because the attributes of the pte don't match.
Let's take an address from /proc/kallsyms:
c0008054 t __create_page_tables
I can read it with gdb:
(gdb) x/2x 0xc0008054
0xc0008054 <__create_page_tables>: 0xe2884901 0xe1a00004
(gdb)
But the pte I get from this address, doesn't have the present flag:
I'm checking it with (this pte is the one I got from my lookup_address):
ret = pte_present (*pte);
pr_info ("pte_present: %d\n", ret);
The pte_present is 0 (which checks L_PTE_PRESENT flag defined include/asm/pgtable-2level.h), but shoudln't be 0 as long as I can read from it in GDB.
I've tested with some other addresses, for instance: 0xc0035618:
c0035618 T __put_task_struct
And for this one the L_PTE_PRESENT big is set.
I'm pretty sure I'm missing something, or I got it wrong.
Thanks in advance!

I've read all the pointed links, but I'm afraig I still don't get the whole picture.
I'll try to explain what I understood so far:
From include/asm/pgtable-2level.h , it looks like a page stores:
0 - pte1_linux
1024 - pte2_linux
2048 - pte1_hw
3072 - pte2_hw
Actually I also saw this in early_pte_alloc function, which allocates 4096bytes for the pte:
static pte_t * __init early_pte_alloc(pmd_t *pmd, unsigned long addr, unsigned long prot)
{
if (pmd_none(*pmd)) {
pte_t *pte = early_alloc(PTE_HWTABLE_OFF + PTE_HWTABLE_SIZE);
__pmd_populate(pmd, __pa(pte), prot);
}
BUG_ON(pmd_bad(*pmd));
return pte_offset_kernel(pmd, addr);
}
Then, in __pmd_populate we take the phys address of the memory previously allocated, we add 2048 (for hw pte), and we OR it with the protection flag (which in case to be from a kernel_domain should be PMD_TYPE_TABLE.
static inline void __pmd_populate(pmd_t *pmdp, phys_addr_t pte,
pmdval_t prot)
{
pmdval_t pmdval = (pte + PTE_HWTABLE_OFF) | prot;
pmdp[0] = __pmd(pmdval);
#ifndef CONFIG_ARM_LPAE
pmdp[1] = __pmd(pmdval + 256 * sizeof(pte_t));
#endif
flush_pmd_entry(pmdp);
}
So far is clear.
Given this information, a walking page should be something like:
pmd_offset_k (addr)
pud_offset (pgd, addr)
pmd_offset (pud, addr)
pte_offset_kernel (pmd, addr)
pte_offset_kernel gives the virtual address of the pmd_val stored in pmd. (it also ANDs the value with PHYS_MASK and PAGE_MASK), and adds the pte_index (addr).
At this point I should have the value of the virtual addres of the linux_pte_0 (because the previous ANDs with PAGE_MASK brought me to the top of the page).
So I think at this point I should be able to check the L_PTE_* flags.
Am I wrong?
Thanks in advance

Related

How to reserve physical memory in kernel (arm64)

I want to reserve some memory to save kernel information. I copied reserve_crashkernel function to arm64 and modified it:
/* 16M alignment for crash kernel regions */
#define CRASH_ALIGN (16 << 20)
/* Location of the reserved area for the crash kernel */
struct resource crashk_res = {
.name = "Crash kernel",
.start = 0,
.end = 0,
.flags = IORESOURCE_MEM
};
static void __init reserve_crashkernel(void)
{
unsigned long long crash_size, crash_base, total_mem;
int ret;
crash_size = CRASH_ALIGN;
total_mem = memblock_phys_mem_size();
pr_info("crashkernel find memory %x - %llx.\n", CRASH_ALIGN, memblock_end_of_DRAM());
crash_base = memblock_find_in_range(CRASH_ALIGN, memblock_end_of_DRAM(),
crash_size, CRASH_ALIGN);
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable area found.\n");
return;
}
ret = memblock_reserve(crash_base, crash_size);
if (ret) {
pr_err("%s: Error reserving crashkernel memblock.\n", __func__);
return;
}
pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crash_base >> 20),
(unsigned long)(total_mem >> 20));
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
insert_resource(&iomem_resource, &crashk_res);
}
When the kernel started, I can find kernel print like this:
[ 0.000000] crashkernel find memory 1000000 - 210000000.
[ 0.000000] Reserving 16MB of memory at 8272MB for crashkernel (System RAM: 8190MB)
But the /proc/iomem doesn't seem right. Without my code there is a 'System RAM' region:
100000000-20fffffff : System RAM
Now with reserve_crashkernel, the region changed to:
205000000-205ffffff : Crash kernel
I don't why the 'System RAM' region disappeared and I'm not sure that my code is correct.

on PowerPC P2040/E500mc the LD instruction EA causing kernel panic during PCI card pull out

Everything I have read so far points to the fact that when accessing PCI address space during card pull out will cause kernel panic if not handled in the kernel machine_check_handler. The machine_check_handler for e500mc looks for the EA(Effective Address) of the instruction in the MCSRR0 register and compares it agains PCI address space. However, since this address (EA) was not in PCI address space, caused the kernel panic eventually, as it could not be handled in the machine check interrupt handler as the address was some bad address that was stored by CPU in the MCSRR0.
Although the GPRs are all pointing to PCI address space BAR addresses from previous cpu instructions, but the Effective Address stored in the MCSRR0 register is the same invalid physical address that the NIP is pointing to...
The MCSRR1 points to machine state (MSR) at the point of interrupt and shows LD|GLD bits set along with MCSRR1[RI] bit. so its a recoverable synchronous interrupt.
And since the CPU address access was on an external hot-plugged device we need not crash the system even if the device is not present and hence the kernel check and safe return from interrupt.
I have a few questions regarding this issue:
Which GPRs are used to determine the effective address of the LD instruction. The LD bit is set in MCSR register? How do I tell which addressing mode was used for generating the effective address for the LD instruction?
the LD instruction uses rD,rA,rB operands, how do i find which EA calculation mode is being used by the processor. Apparently there are 4 of them. Also, which GPR's do each of these operands point to? I couldn't figure it out from the E500MCRM or powepc EREF.
Since we are writing to PCI address space from user space, the PCI device registers are mapped to some virtual address space in the process memory to which we are writing. This is non cached mapping as far as i know.
Does the CPU address translation to PCI device physical address, for accessing the PCI device result in bad address as the PCI device is no longer connected. My assumption for this was, since the device is no longer present the effective address returned was some junk value that caused this kernel panic. I am not sure if that's how CPU works.
Any suggestions helping my understanding are welcome. this is way deep down and beyond my expertise. I have gone through the E500MCRM, P2040RM and powerpc EREF but I cannot figure out why I am getting a bad address instead of a PCI physical address in the Effective address.
kernel - crash dump
fujitsu:~$ fsl_pci_mcheck_exception-> SPRN_MCAR: 0x0
fsl_pci_mcheck_exception-> SPRN_MCSRR0: 0x0f6fec68
fsl_pci_mcheck_exception-> SPRN_MCSRR1: 0x2d002
fsl_pci_mcheck_exception-> SPRN_MCAR: 0x0
fsl_pci_mcheck_exception-> SPRN_DEAR: 0x0
fsl_pci_mcheck_exception-> current->pid: [8333]
fsl_pci_mcheck_exception-> after __get_user_inatomic(inst, &regs->nip): 0x0f6fec68(inst), 0x0f6fec68(regs->nip), 0x0(ret)
Machine check in kernel mode.
Caused by (from MCSR=a000): Load Error Report
Guarded Load Error Report
Oops: Machine check, sig: 7 [#1]
PREEMPT SMP NR_CPUS=4 P2041 RDB
Modules linked in: i2cBridge(O) interruptDriver_pb(O) cma_alloc(O) hwtp_drv(O) interruptDriver_wdt(O)
NIP: 0f6fec68 LR: 0f6fec4c CTR: 0f6faad4
REGS: e4ec5f10 TRAP: 0204 Tainted: G O (3.8.13-rt9+)
MSR: 0002d002 <CE,EE,PR,ME> CR: 40044442 XER: 20000000
TASK = e57dc020[8333] 'RxManager' THREAD: e4ec4000 CPU: 3
GPR00: 0f6fec4c 52afea90 52b06910 50400000 52afeb50 00000003 a0105210 52afebfc
GPR08: a1ffffff a0000000 0000000c a0000000 20044448 1032e800 52900000 00000006
GPR16: 0f74f434 0f729d20 135a78a0 00200000 0fe28280 52aff4b0 00000000 0fe2a6c8
GPR24: 52afec98 0f6cd268 135a7630 00105210 52afebfc 50400000 0f71d31c 00000003
NIP [0f6fec68] 0xf6fec68
LR [0f6fec4c] 0xf6fec4c
Call Trace:
---[ end trace 2715d0da39427f69 ]---
here's the code from fsl_pci.c that's getting called from machine_check_handler
#ifdef CONFIG_E500
static int mcheck_handle_load(struct pt_regs *regs, u32 inst)
{
unsigned int rd, ra, rb, d;
rd = get_rt(inst);
ra = get_ra(inst);
rb = get_rb(inst);
d = get_d(inst);
printk(KERN_EMERG "%s==> rd==0x%x, ra=0x%x, rb=0x%x, d=0x%x\n", __FUNCTION__, rd, ra, rb, d);
printk(KERN_EMERG "%s==> get_op(inst) = 0x%x\n", __FUNCTION__, get_op(inst));
return 1;
switch (get_op(inst)) {
case 31:
switch (get_xop(inst)) {
case OP_31_XOP_LWZX:
case OP_31_XOP_LWBRX:
regs->gpr[rd] = 0xffffffff;
break;
case OP_31_XOP_LWZUX:
regs->gpr[rd] = 0xffffffff;
regs->gpr[ra] += regs->gpr[rb];
break;
case OP_31_XOP_LBZX:
regs->gpr[rd] = 0xff;
break;
case OP_31_XOP_LBZUX:
regs->gpr[rd] = 0xff;
regs->gpr[ra] += regs->gpr[rb];
break;
case OP_31_XOP_LHZX:
case OP_31_XOP_LHBRX:
regs->gpr[rd] = 0xffff;
break;
case OP_31_XOP_LHZUX:
regs->gpr[rd] = 0xffff;
regs->gpr[ra] += regs->gpr[rb];
break;
default:
return 0;
}
break;
case OP_LWZ:
regs->gpr[rd] = 0xffffffff;
break;
case OP_LWZU:
regs->gpr[rd] = 0xffffffff;
regs->gpr[ra] += (s16)d;
break;
case OP_LBZ:
regs->gpr[rd] = 0xff;
break;
case OP_LBZU:
regs->gpr[rd] = 0xff;
regs->gpr[ra] += (s16)d;
break;
case OP_LHZ:
regs->gpr[rd] = 0xffff;
break;
case OP_LHZU:
regs->gpr[rd] = 0xffff;
regs->gpr[ra] += (s16)d;
break;
default:
return 0;
}
return 1;
}
static int is_in_pci_mem_space(phys_addr_t addr)
{
struct pci_controller *hose;
struct resource *res;
int i;
list_for_each_entry(hose, &hose_list, list_node) {
if (!(hose->indirect_type & PPC_INDIRECT_TYPE_EXT_REG))
continue;
for (i = 0; i < 3; i++) {
res = &hose->mem_resources[i];
if ((res->flags & IORESOURCE_MEM) &&
addr >= res->start && addr <= res->end)
printk(KERN_EMERG "%s ==> returning from checking addresses\n", __FUNCTION__);
return 1;
}
}
printk(KERN_EMERG "%s ==> returning without checking addresses\n", __FUNCTION__);
return 1;
}
int fsl_pci_mcheck_exception(struct pt_regs *regs)
{
u32 inst;
int ret;
phys_addr_t addr = 0;
/* Let KVM/QEMU deal with the exception */
if (regs->msr & MSR_GS)
return 0;
#ifdef CONFIG_PHYS_64BIT
addr = mfspr(SPRN_MCARU);
addr <<= 32;
#endif
addr += mfspr(SPRN_MCSRR0);
printk(KERN_EMERG "%s-> SPRN_MCAR: 0x%x\n", __FUNCTION__, addr);
printk(KERN_EMERG "%s-> SPRN_MCSRR0: 0x%x\n", __FUNCTION__, mfspr(SPRN_MCSRR0));
printk(KERN_EMERG "%s-> SPRN_MCSRR1: 0x%x\n", __FUNCTION__, mfspr(SPRN_MCSRR1));
printk(KERN_EMERG "%s-> current->pid: 0x%x\n", __FUNCTION__, current->pid);
#ifdef CONFIG_E500
if (mfspr(SPRN_EPCR) & SPRN_EPCR_ICM)
addr = PFN_PHYS(vmalloc_to_pfn((void *)mfspr(SPRN_DEAR)));
printk(KERN_EMERG "%s-> SPRN_DEAR: 0x%x\n", __FUNCTION__, addr);
#endif
printk(KERN_EMERG "%s-> before get_user: 0x%x, 0x%x\n", __FUNCTION__, regs->nip, inst);
if (is_in_pci_mem_space(addr)) {
if (user_mode(regs)) {
pagefault_disable();
/* I am using __get_user_inatomic to get the instruction from the user
space as any other get_user versions were resulting in -EFAULT as they can
sleep and this needs to be called from user context and we are in interrupt
context.
*/
ret = __get_user_inatomic(inst, &regs->nip);
pagefault_enable();
} else {
ret = probe_kernel_address(regs->nip, inst);
}
printk(KERN_EMERG "%s-> after get_user: 0x%x, 0x%x, 0x%d\n", __FUNCTION__, regs->nip, inst, ret);
if (mcheck_handle_load(regs, inst)) {
regs->nip += 4;
printk(KERN_EMERG "%s-> after mcheck_handle load: 0x%x, 0x%x\n", __FUNCTION__, regs->nip, inst);
return 1;
}
}
return 0;
}
#endif
Here's the code I added to fix the kernel panic. Looks like regs->gpr[0] is destination address of the LD instruction and incrementing the instruction pointer took care of the return from the interrupt context cleanly. I still have the issue of verifying that this interrupt originated due to access of PCI device address. Right now I have commented out the PCI address range check and without this check I am able to access any address without crashing the system which is even worse.
Yes. Even a null pointer access doesn't crash the system anymore. I tried it with devmem2 and accessed a nullpointer and the call goes through the interrupt and returns safely after dumping the logs from the interrupt handler.
regs->gpr[0] = 0xffffffff;
regs->nip += 4;
return 1;
if (mcheck_handle_load(regs, inst)) {

How to use mmap for devmemon MT7620n

I`m trying to access the GPIOs of a MT7620n via register settings. So far I can access them by using /sys/class/gpio/... but that is not fast enough for me.
In the Programming guide of the MT7620 page 84 you can see that the GPIO base address is at 0x10000600 and the single registers have an offset of 4 Bytes.
MT7620 Programming Guide
Something like:
devmem 0x10000600
from the shell works absolutely fine but I cannot access it from inside of a c Programm.
Here is my code:
#define GPIOCHIP_0_ADDDRESS 0x10000600 // base address
#define GPIO_BLOCK 4
volatile unsigned long *gpiochip_0_Address;
int gpioSetup()
{
int m_mfd;
if ((m_mfd = open("/dev/mem", O_RDWR)) < 0)
{
printf("ERROR open\n");
return -1;
}
gpiochip_0_Address = (unsigned long*)mmap(NULL, GPIO_BLOCK, PROT_READ|PROT_WRITE, MAP_SHARED, m_mfd, GPIOCHIP_0_ADDDRESS);
close(m_mfd);
if(gpiochip_0_Address == MAP_FAILED)
{
printf("mmap() failed at phsical address:%d %s\n", GPIOCHIP_0_ADDDRESS, strerror(errno));
return -2;
}
return 0;
}
The Output I get is:
mmap() failed at phsical address:268436992 Invalid argument
What do I have to take care of? Do I have to make the memory accessable before? I´m running as root.
Thanks
EDIT
Peter Cordes is right, thank you so much.
Here is my final solution, if somebody finds a bug, please tell me ;)
#define GPIOCHIP_0_ADDDRESS 0x10000600 // base address
volatile unsigned long *gpiochip_0_Address;
int gpioSetup()
{
const size_t pagesize = sysconf(_SC_PAGE_SIZE);
unsigned long gpiochip_pageAddress = GPIOCHIP_0_ADDDRESS & ~(pagesize-1); //get the closest page-sized-address
const unsigned long gpiochip_0_offset = GPIOCHIP_0_ADDDRESS - gpiochip_pageAddress; //calculate the offset between the physical address and the page-sized-address
int m_mfd;
if ((m_mfd = open("/dev/mem", O_RDWR)) < 0)
{
printf("ERROR open\n");
return -1;
}
page_virtual_start_Address = (unsigned long*)mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED, m_mfd, gpiochip_pageAddress);
close(m_mfd);
if(page_virtual_start_Address == MAP_FAILED)
{
printf("ERROR mmap\n");
printf("mmap() failed at phsical address:%d %d\n", GPIOCHIP_0_ADDDRESS, strerror(errno));
return -2;
}
gpiochip_0_Address = page_virtual_start_Address + (gpiochip_0_offset/sizeof(long));
return 0;
}
mmap's file offset argument has to be page-aligned, and that's one of the documented reasons for mmap to fail with EINVAL.
0x10000600 is not a multiple of 4k, or even 1k, so that's almost certainly your problem. I don't think any systems have pages as small as 512B.
mmap a whole page that includes the address you want, and access the MMIO registers at an offset within that page.
Either hard-code it, or maybe do something like GPIOCHIP_0_ADDDRESS & ~(page_size-1) to round down an address to a page-aligned boundary. You should be able to do something that gets the page size as a compile-time constant so it still compiles efficiently.

Getting TSC rate from x86 kernel

I have an embedded Linux system running on an Atom, which is a new enough CPU to have an invariant TSC (time stamp counter), whose frequency the kernel measures on startup. I use the TSC in my own code to keep time (avoiding kernel calls), and my startup code measures the TSC rate, but I'd rather just use the kernel's measurement. Is there any way to retrieve this from the kernel? It's not in /proc/cpuinfo anywhere.
BPFtrace
As root, you can retrieve the kernel's TSC rate with bpftrace:
# bpftrace -e 'BEGIN { printf("%u\n", *kaddr("tsc_khz")); exit(); }' | tail -n
(tested it on CentOS 7 and Fedora 29)
That is the value that is defined, exported and maintained/calibrated in arch/x86/kernel/tsc.c.
GDB
Alternatively, also as root, you can also read it from /proc/kcore, e.g.:
# gdb /dev/null /proc/kcore -ex 'x/uw 0x'$(grep '\<tsc_khz\>' /proc/kallsyms \
| cut -d' ' -f1) -batch 2>/dev/null | tail -n 1 | cut -f2
(tested it on CentOS 7 and Fedora 29)
SystemTap
If the system doesn't have bpftrace nor gdb available but SystemTap you can get it like this (as root):
# cat tsc_khz.stp
#!/usr/bin/stap -g
function get_tsc_khz() %{ /* pure */
THIS->__retvalue = tsc_khz;
%}
probe oneshot {
printf("%u\n", get_tsc_khz());
}
# ./tsc_khz.stp
Of course, you can also write a small kernel module that provides access to tsc_khz via the /sys pseudo file system. Even better, somebody already did that and a tsc_freq_khz module is available on GitHub. With that the following should work:
# modprobe tsc_freq_khz
$ cat /sys/devices/system/cpu/cpu0/tsc_freq_khz
(tested on Fedora 29, reading the sysfs file doesn't require root)
Kernel Messages
In case nothing of the above is an option you can parse the TSC rate from the kernel logs. But this gets ugly fast because you see different kinds of messages on different hardware and kernels, e.g. on a Fedora 29 i7 system:
$ journalctl --boot | grep 'kernel: tsc:' -i | cut -d' ' -f5-
kernel: tsc: Detected 2800.000 MHz processor
kernel: tsc: Detected 2808.000 MHz TSC
But on a Fedora 29 Intel Atom just:
kernel: tsc: Detected 2200.000 MHz processor
While on a CentOS 7 i5 system:
kernel: tsc: Fast TSC calibration using PIT
kernel: tsc: Detected 1895.542 MHz processor
kernel: tsc: Refined TSC clocksource calibration: 1895.614 MHz
Perf Values
The Linux Kernel doesn't provide an API to read the TSC rate, yet. But it does provide one for getting the mult and shift values that can be used to convert TSC counts to nanoseconds. Those values are derived from tsc_khz - also in arch/x86/kernel/tsc.c - where tsc_khz is initialized and calibrated. And they are shared with userspace.
Example program that uses the perf API and accesses the shared page:
#include <asm/unistd.h>
#include <inttypes.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
}
The actual code:
int main(int argc, char **argv)
{
struct perf_event_attr pe = {
.type = PERF_TYPE_HARDWARE,
.size = sizeof(struct perf_event_attr),
.config = PERF_COUNT_HW_INSTRUCTIONS,
.disabled = 1,
.exclude_kernel = 1,
.exclude_hv = 1
};
int fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
perror("perf_event_open failed");
return 1;
}
void *addr = mmap(NULL, 4*1024, PROT_READ, MAP_SHARED, fd, 0);
if (!addr) {
perror("mmap failed");
return 1;
}
struct perf_event_mmap_page *pc = addr;
if (pc->cap_user_time != 1) {
fprintf(stderr, "Perf system doesn't support user time\n");
return 1;
}
printf("%16s %5s\n", "mult", "shift");
printf("%16" PRIu32 " %5" PRIu16 "\n", pc->time_mult, pc->time_shift);
close(fd);
}
Tested in on Fedora 29 and it works also for non-root users.
Those values can be used to convert a TSC count to nanoseconds with a function like this one:
static uint64_t mul_u64_u32_shr(uint64_t cyc, uint32_t mult, uint32_t shift)
{
__uint128_t x = cyc;
x *= mult;
x >>= shift;
return x;
}
CPUID/MSR
Another way to obtain the TSC rate is to follow DPDK's lead.
DPDK on x86_64 basically uses the following strategy:
Read the 'Time Stamp Counter and Nominal Core Crystal Clock Information Leaf' via cpuid intrinsics (doesn't require special privileges), if available
Read it from the MSR (requires the rawio capability and read permissions on /dev/cpu/*/msr), if possible
Calibrate it in userspace by other means, otherwise
FWIW, a quick test shows that the cpuid leaf doesn't seem to be that widely available, e.g. an i7 Skylake and a goldmont atom don't have it. Otherwise, as can be seen from the DPDK code, using the MSR requires a bunch of intricate case distinctions.
However, in case the program already uses DPDK, getting the TSC rate, getting TSC values or converting TSC values is just a matter of using the right DPDK API.
I had a brief look and there doesn't seem to be a built-in way to directly get this information from the kernel.
However, the symbol tsc_khz (which I'm guessing is what you want) is exported by the kernel. You could write a small kernel module that exposes a sysfs interface and use that to read out the value of tsc_khz from userspace.
If writing a kernel module is not an option, it may be possible to use some Dark Magic™ to read out the value directly from the kernel memory space. Parse the kernel binary or System.map file to find the location of the tsc_khz symbol and read it from /dev/{k}mem. This is, of course, only possible provided that the kernel is configured with the appropriate options.
Lastly, from reading the kernel source comments, it looks like there's a possibility that the TSC may be unstable on some platforms. I don't know much about the inner workings of the x86 arch but this may be something you want to take into consideration.
The TSC rate is directly related to "cpu MHz" in /proc/cpuinfo. Actually, the better number to use is "bogomips". The reason is that while the freq for TSC is the max CPU freq, the current "cpu Mhz" can vary at time of your invocation.
The bogomips value is computed at boot. You'll need to adjust this value by number of cores and processor count (i.e. the number of hyperthreads) That gives you [fractional] MHz. That is what I use to do what you want to do.
To get the processor count, look for the last "processor: " line. The processor count is <value> + 1. Call it "cpu_count".
To get number of cores, any "cpu cores: " works. number of cores is <value>. Call it "core_count".
So, the formula is:
smt_count = cpu_count;
if (core_count)
smt_count /= core_count;
cpu_freq_in_khz = (bogomips * scale_factor) / smt_count;
That is extracted from my actual code, which is below.
Here's the actual code I use. You won't be able to use it directly because it relies on boilerplate I have, but it should give you some ideas, particularly with how to compute
// syslgx/tvtsc -- system time routines (RDTSC)
#include <tgb.h>
#include <zprt.h>
tgb_t systvinit_tgb[] = {
{ .tgb_val = 1, .tgb_tag = "cpu_mhz" },
{ .tgb_val = 2, .tgb_tag = "bogomips" },
{ .tgb_val = 3, .tgb_tag = "processor" },
{ .tgb_val = 4, .tgb_tag = "cpu_cores" },
{ .tgb_val = 5, .tgb_tag = "clflush_size" },
{ .tgb_val = 6, .tgb_tag = "cache_alignment" },
TGBEOT
};
// _systvinit -- get CPU speed
static void
_systvinit(void)
{
const char *file;
const char *dlm;
XFIL *xfsrc;
int matchflg;
char *cp;
char *cur;
char *rhs;
char lhs[1000];
tgb_pc tgb;
syskhz_t khzcpu;
syskhz_t khzbogo;
syskhz_t khzcur;
sysmpi_p mpi;
file = "/proc/cpuinfo";
xfsrc = fopen(file,"r");
if (xfsrc == NULL)
sysfault("systvinit: unable to open '%s' -- %s\n",file,xstrerror());
dlm = " \t";
khzcpu = 0;
khzbogo = 0;
mpi = &SYS->sys_cpucnt;
SYSZAPME(mpi);
// (1) look for "cpu MHz : 3192.515" (preferred)
// (2) look for "bogomips : 3192.51" (alternate)
// FIXME/CAE -- on machines with speed-step, bogomips may be preferred (or
// disable it)
while (1) {
cp = fgets(lhs,sizeof(lhs),xfsrc);
if (cp == NULL)
break;
// strip newline
cp = strchr(lhs,'\n');
if (cp != NULL)
*cp = 0;
// look for symbol value divider
cp = strchr(lhs,':');
if (cp == NULL)
continue;
// split symbol and value
*cp = 0;
rhs = cp + 1;
// strip trailing whitespace from symbol
for (cp -= 1; cp >= lhs; --cp) {
if (! XCTWHITE(*cp))
break;
*cp = 0;
}
// convert "foo bar" into "foo_bar"
for (cp = lhs; *cp != 0; ++cp) {
if (XCTWHITE(*cp))
*cp = '_';
}
// match on interesting data
matchflg = 0;
for (tgb = systvinit_tgb; TGBMORE(tgb); ++tgb) {
if (strcasecmp(lhs,tgb->tgb_tag) == 0) {
matchflg = tgb->tgb_val;
break;
}
}
if (! matchflg)
continue;
// look for the value
cp = strtok_r(rhs,dlm,&cur);
if (cp == NULL)
continue;
zprt(ZPXHOWSETUP,"_systvinit: GRAB/%d lhs='%s' cp='%s'\n",
matchflg,lhs,cp);
// process the value
// NOTE: because of Intel's speed step, take the highest cpu speed
switch (matchflg) {
case 1: // genuine CPU speed
khzcur = _systvinitkhz(cp);
if (khzcur > khzcpu)
khzcpu = khzcur;
break;
case 2: // the consolation prize
khzcur = _systvinitkhz(cp);
// we've seen some "wild" values
if (khzcur > 10000000)
break;
if (khzcur > khzbogo)
khzbogo = khzcur;
break;
case 3: // remember # of cpu's so we can adjust bogomips
mpi->mpi_cpucnt = atoi(cp);
mpi->mpi_cpucnt += 1;
break;
case 4: // remember # of cpu cores so we can adjust bogomips
mpi->mpi_corecnt = atoi(cp);
break;
case 5: // cache flush size
mpi->mpi_cshflush = atoi(cp);
break;
case 6: // cache alignment
mpi->mpi_cshalign = atoi(cp);
break;
}
}
fclose(xfsrc);
// we want to know the number of hyperthreads
mpi->mpi_smtcnt = mpi->mpi_cpucnt;
if (mpi->mpi_corecnt)
mpi->mpi_smtcnt /= mpi->mpi_corecnt;
zprt(ZPXHOWSETUP,"_systvinit: FINAL khzcpu=%d khzbogo=%d mpi_cpucnt=%d mpi_corecnt=%d mpi_smtcnt=%d mpi_cshalign=%d mpi_cshflush=%d\n",
khzcpu,khzbogo,mpi->mpi_cpucnt,mpi->mpi_corecnt,mpi->mpi_smtcnt,
mpi->mpi_cshalign,mpi->mpi_cshflush);
if ((mpi->mpi_cshalign == 0) || (mpi->mpi_cshflush == 0))
sysfault("_systvinit: cache parameter fault\n");
do {
// use the best reference
// FIXME/CAE -- with speed step, bogomips is better
#if 0
if (khzcpu != 0)
break;
#endif
khzcpu = khzbogo;
if (mpi->mpi_smtcnt)
khzcpu /= mpi->mpi_smtcnt;
if (khzcpu != 0)
break;
sysfault("_systvinit: unable to obtain cpu speed\n");
} while (0);
systvkhz(khzcpu);
zprt(ZPXHOWSETUP,"_systvinit: EXIT\n");
}
// _systvinitkhz -- decode value
// RETURNS: CPU freq in khz
static syskhz_t
_systvinitkhz(char *str)
{
char *src;
char *dst;
int rhscnt;
char bf[100];
syskhz_t khz;
zprt(ZPXHOWSETUP,"_systvinitkhz: ENTER str='%s'\n",str);
dst = bf;
src = str;
// get lhs of lhs.rhs
for (; *src != 0; ++src, ++dst) {
if (*src == '.')
break;
*dst = *src;
}
// skip over the dot
++src;
// get rhs of lhs.rhs and determine how many rhs digits we have
rhscnt = 0;
for (; *src != 0; ++src, ++dst, ++rhscnt)
*dst = *src;
*dst = 0;
khz = atol(bf);
zprt(ZPXHOWSETUP,"_systvinitkhz: PRESCALE bf='%s' khz=%d rhscnt=%d\n",
bf,khz,rhscnt);
// scale down (e.g. we got xxxx.yyyy)
for (; rhscnt > 3; --rhscnt)
khz /= 10;
// scale up (e.g. we got xxxx.yy--bogomips does this)
for (; rhscnt < 3; ++rhscnt)
khz *= 10;
zprt(ZPXHOWSETUP,"_systvinitkhz: EXIT khz=%d\n",khz);
return khz;
}
UPDATE:
Sigh. Yes.
I was using "cpu MHz" from /proc/cpuinfo prior to the introduction of processors with "speed step" technology, so I switched to "bogomips" and the algorithm was derived empirically based on that. When I derived it, I only had access to hyperthreaded machines. However, I've found an old one that is not and the SMT thing isn't valid.
However, it appears that bogomips is always 2x the [maximum] CPU speed. See http://www.clifton.nl/bogo-faq.html That hasn't always been my experience on all kernel versions over the years [IIRC, I started with 0.99.x], but it's probably a reliable assumption these days.
With "constant TSC" [which all newer processors have], denoted by constant_tsc in the flags: field in /proc/cpuinfo, the TSC rate is the maximum CPU frequency.
Originally, the only way to get the frequency information was from /proc/cpuinfo. Now, however, in more modern kernels, there is another way that may be easier and more definitive [I had code coverage for this in other software of mine, but had forgotten about it]:
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
The contents of this file is the maximum CPU frequency in kHz. There are analogous files for the other CPU cores. The files should be identical for most sane motherboards (e.g. ones that are composed of the same model chip and don't try to mix [say] i7s and atoms). Otherwise, you'd have to keep track of the info on a per-core basis and that would get messy fast.
The given directory also has other interesting files. For example, if your processor has "speed step" [and some of the other files can tell you that], you can force maximum performance by writing performance to the scaling_governor file. This will disable use of speed step.
If the processor did not have constant_tsc, you'd have to disable speed step [and run the cores at maximum rate] to get accurate measurements

linux driver, port 2.6.19.2 - 2.6.38-rc2 ARM11 iMX31, amba MBX device LogicPD Litekit GLES driver

Code followed with question
#define MBX_REG_SYS_PHYS_BASE 0xC0000000
#define MBX_REG_RANGE 0x00004000
static struct resource mxc_reg_resources[] = {
{
.start = MBX_REG_SYS_PHYS_BASE,
.end = MBX_REG_SYS_PHYS_BASE + MBX_REG_RANGE - 1,
.flags = IORESOURCE_MEM }
};
mbx_reg = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!mbx_reg)
return -EINVAL;
reg_base = ioremap(mbx_reg->start, resource_size(mbx_reg));
if (!reg_base) {
ret = -ENOMEM;
goto eremap;
}
printk(KERN_CRIT "Address: from 0x%08X to 0x%08X\n",
mbx_reg->start, reg_base);
regread = mx3reg_read_reg(mx3reg, MBX1_GLOBREG_REVISION);
printk(KERN_CRIT "MBX1_GLOBREG_REVISION: 0x%.8X\n", regread);
This code works on iMX31 from LogicPD using 2.6.19.2 with out of tree patching from freescale.
when porting it to 2.6.38-rc2 it no longer works.
here are some data results:
Working results:
Address: 0xC7860000
MBX1_GLOBREG_REVISION: 0x01010200
Failed results:
Address: 0xC48A0000
MBX1_GLOBREG_REVISION: 0x00000000
Address: 0xC48A8000
MBX1_GLOBREG_REVISION: 0x00000000
Address: 0xC48B8000
MBX1_GLOBREG_REVISION: 0x00000000
Address: 0xC48C0000
MBX1_GLOBREG_REVISION: 0x00000000
maybe interesting is on 2.6.19.2 it always gets the same address mapped
yet in 2.6.38-rc2 it does not.
Are you sure your defines are still good ? The output for this line should not change :
printk(KERN_CRIT "Address: from 0x%08X to 0x%08X\n",
mbx_reg->start, reg_base);
Since it is a physical address. However it is not printed in your output.
Check the pripheral you are accessing is clocked.
In order to have this device ready to communicate you need to setup the peripheral port remap register
/* Setup Peripheral Port Remap register for AVIC */
asm("ldr r0, =0xC0000015 \n\
mcr p15, 0, r0, c15, c2, 4");
here is the code from the original 2.6.19.2 kernel, executed from a board fixup routine.
and of course the clocks would have to be enabled as well, and this driver example is not showing that either.

Resources