What does the linux scheduler return when there are no tasks left in the queue - linux-kernel

The linux scheduler calls for an scheduler algorithm that finds the next task in the task list.
What does the scheduling algorithm return if there are no tasks left?
Below is a piece of code
struct rt_info* sched_something(struct list_head *head, int flags)
{
/* some logic */
return some_task; // What task's value if there are no tasks left.
}

There is special "idle" task with PID=0, sometimes it is called swapper (Why do we need a swapper task in linux?). This task is scheduled to CPU core when there is no other task ready to run. There are several idle tasks - one for each CPU core
Source kernel/sched/core.c http://lxr.free-electrons.com/source/kernel/sched/core.c?v=3.16#L3160
3160 /**
3161 * idle_task - return the idle task for a given cpu.
3162 * #cpu: the processor in question.
3163 *
3164 * Return: The idle task for the cpu #cpu.
3165 */
3166 struct task_struct *idle_task(int cpu)
3167 {
3168 return cpu_rq(cpu)->idle;
3169 }
3170
So, pointer to this task is stored in runqueue (struct rq) kernel/sched/sched.h:
502 * This is the main, per-CPU runqueue data structure. ... */
508 struct rq {
557 struct task_struct *curr, *idle, *stop;
There is some init code in sched/core.c:
4517 /**
4518 * init_idle - set up an idle thread for a given CPU
4519 * #idle: task in question
4520 * #cpu: cpu the idle task belongs to
4524 */
4525 void init_idle(struct task_struct *idle, int cpu)
I think, idle task will run some kind of loop with special asm commands to inform CPU core that there is no useful job...
This post http://duartes.org/gustavo/blog/post/what-does-an-idle-cpu-do/ says idle task executes cpu_idle_loop kernel/sched/idle.c (there may be custom version of loop for arch and CPU - called with cpu_idle_poll(void) -> cpu_relax();):
40 #define cpu_relax() asm volatile("rep; nop")
45 static inline int cpu_idle_poll(void)
..
50 while (!tif_need_resched())
51 cpu_relax();
221 if (cpu_idle_force_poll || tick_check_broadcast_expired())
222 cpu_idle_poll();
223 else
224 cpuidle_idle_call();

Related

for_each_possible_cpu macro in vmalloc_init() function, does the code run in only one cpu? or in every cpu?

this is not about programming, but I ask it here..
in linux start_kernel() function, in the mm_init() function, I see vmalloc_init() function.
inside the function I see codes like this.
void __init vmalloc_init(void)
{
struct vmap_area *va;
struct vm_struct *tmp;
int i;
/*
* Create the cache for vmap_area objects.
*/
vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
for_each_possible_cpu(i) {
struct vmap_block_queue *vbq;
struct vfree_deferred *p;
vbq = &per_cpu(vmap_block_queue, i);
spin_lock_init(&vbq->lock);
INIT_LIST_HEAD(&vbq->free);
p = &per_cpu(vfree_deferred, i);
init_llist_head(&p->list);
INIT_WORK(&p->wq, free_work);
}
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (WARN_ON_ONCE(!va))
continue;
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
}
/*
* Now we can initialize a free vmap space.
*/
vmap_init_free_space();
vmap_initialized = true;
}
I'm not sure if this code is run on every cpu(core) or just on the first cpu?
if this code runs on every smp core, how is this code inside for_each_possible_cpu loop run?
The smp setup seems to be done before this function.
start_kernel() calls mm_init() which calls vmalloc_init(). Only the first (boot) CPU is active at that point. Later, start_kernel() calls arch_call_rest_init() which calls rest_init().
rest_init() creates a kernel thread for the init task with entry point kernel_init(). kernel_init() calls kernel_init_freeable(). kernel_init_freeable() eventually calls smp_init() to activate the remaining CPUs.
Every macro in for_each_cpu family is just wrapper for for() loop, where iterator is a CPU index.
E.g., the core macro of this family is defined as
#define for_each_cpu(cpu, mask) \
for ((cpu) = -1; \
(cpu) = cpumask_next((cpu), (mask)), \
(cpu) < nr_cpu_ids;)
Each macro in for_each_cpu family uses its own CPUs mask, which is just a set of bits corresponded to CPU indices. E.g. mask for for_each_possible_cpu macro have bits set for every index of CPU which could ever be enabled in current machine session.

Trigger Countdown with 433 MHz transmission

I would like to use an arduino to read 433 MHz transmission from multiple Soil Moisture Sensors. Since I can never be sure all transmissions reach the receiver I'd like to set a countdown from the moment the first transmission is received. If another transmission is received, the countdown starts again.
After a defined amount of time (e.g. 10 Minutes) without any more signals or if all signals have been received (e.g 4 Sensors) the receiving unit should stop and come to decision based on the data it got to the point.
For transmitting and receiving I am using the <RCSwitch.h>library.
The loop of the receiving unit and one Sensor looks like this:
#include <RCSwitch.h>
RCSwitch mySwitch = RCSwitch();
void Setup(){
Serial.begin(9600);
mySwitch.enableReceive(4);
}
void loop() {
if (mySwitch.available()) {
int value = mySwitch.getReceivedValue();
if (value == 0) {
lcd.clear();
Serial.print("Unknown encoding");
}
else {
Serial.print(mySwitch.getReceivedValue());
Serial.print("%");
}
The full code includes some differentiation mechanism for all sensors but I figured that might not be relevant for my question.
Question:
What's the best way to do this without a real time clock module. As far as I know I can't wait by using delay(...)since then I won't receive any data while the processor waiting.
You can use millis() as a clock. It returns the number of milliseconds since the arduino started.
#define MINUTES(x) ((x) * 60000UL)
unsigned long countStart = 0;
void loop()
{
if (/*read from module ok*/)
{
countStart = millis();
// sanity check, since millis() eventually rolls over
if (countStart == 0)
countStart = 1;
}
if (countStart && ((millis() - countStart) > MINUTES(10)))
{
countStart = 0;
// trigger event
}
}
Arduino's internal timers can also be used in this situation. If a long time period is needed, it's better to use 16bit counter (usually timer1) at 1024 prescaler (largest available). If the largest time interval of timer is greater than time required, then a counter have to be added in order to keep track of 1 minute interval.
For example, for 1-minute interval, initialize registers as:
TCCR1A = 0; //Initially setting every register as 0x0000
TCCR1B = 0;
TCNT1 = 0;
OCR1A = 468750; // compare match register 16MHz/1024/2/frequency(hz)
TCCR1B |= (1 << WGM12); // Timer compare mode
TCCR1B |= (1 << CS10) | (1 << CS10); // 1024 prescaler
TIMSK1 |= (1 << OCIE1A); // enable timer compare interrupt
These configuration of timer will give interrupt time of 1 minute. And upon timer completion ISR TIMER1_COMPA_vect will be run. You can play around with value of OCR1A for different interrupt periods.
Main advantage of using interrupts is that they don't block any task and can will be executed instantaneously (if interrupts are not disabled explicitly).

What does a C process status mean in htop?

I am using htop on osx and I can't seem to find out what a 'C' status in the 'S' status column means for a process status.
What does a C process status mean in htop?
Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process:
D Uninterruptible sleep (usually IO)
R Running or runnable (on run queue)
S Interruptible sleep (waiting for an event to complete)
T Stopped, either by a job control signal or because it is being traced.
W Paging (not valid since the 2.6.xx kernel)
X Dead (should never be seen)
Z Defunct ("zombie") process, terminated but not reaped by its parent.
Source: man ps
I recently came across a second list:
R Running
S Sleeping in an interruptible wait
D Waiting in uninterruptible disk sleep
Z Zombie
T Stopped (on a signal) or (before Linux 2.6.33) trace stopped
t Tracing stop (Linux 2.6.33 onward)
W Paging (only before Linux 2.6.0)
X Dead (from Linux 2.6.0 onward)
x Dead (Linux 2.6.33 to 3.13 only)
K Wakekill (Linux 2.6.33 to 3.13 only)
W Waking (Linux 2.6.33 to 3.13 only)
P Parked (Linux 3.9 to 3.13 only)
http://man7.org/linux/man-pages/man5/proc.5.html in the "/proc/[pid]/stat" section:
htop author here. I am not aware of such status code in the htop codebase.
Keep in mind that htop is written for Linux only, so there is no support for macOS/OSX. When I hear of people running it on OSX they are often using an outdated, unsupported fork (the latest version of htop is 2.0.1, including macOS support).
I've got the same question recently. We can try to look it up in the htop sources:
process->state =
ProcessList_decodeState( p->p_stat == SZOMB ? SZOMB : ki->state );
static int
ProcessList_decodeState( int st ) {
switch ( st ) {
case SIDL:
return 'C';
case SRUN:
return 'R';
case SSLEEP:
return 'S';
case SSTOP:
return 'T';
case SZOMB:
return 'Z';
default:
return '?';
}
}
So we go to Unix definition of process states at /usr/include/sys/proc.h:
/* Status values. */
#define SIDL 1 /* Process being created by fork. */
#define SRUN 2 /* Currently runnable. */
#define SSLEEP 3 /* Sleeping on an address. */
#define SSTOP 4 /* Process debugging or suspension. */
#define SZOMB 5 /* Awaiting collection by parent. */
So, 'C' status is meant to be 'Process being created by fork'. What is it? According to old unix sources, it's a transient state that happens when there's not enough memory when forking a process and the parent needs to be swapped out.
What??
Back to htop source. Where do we get the ki->state?
// For all threads in process:
error = thread_info( ki->thread_list[j], THREAD_BASIC_INFO,
( thread_info_t ) & ki->thval[j].tb,
&thread_info_count );
tstate = ProcessList_machStateOrder( ki->thval[j].tb.run_state,
ki->thval[j].tb.sleep_time );
if ( tstate < ki->state )
ki->state = tstate;
// Below...
static int
ProcessList_machStateOrder( int s, long sleep_time ) {
switch ( s ) {
case TH_STATE_RUNNING:
return 1;
case TH_STATE_UNINTERRUPTIBLE:
return 2;
case TH_STATE_WAITING:
return ( sleep_time > 20 ) ? 4 : 3;
case TH_STATE_STOPPED:
return 5;
case TH_STATE_HALTED:
return 6;
default:
return 7;
}
}
// In mach/thread_info.h:
#define TH_STATE_RUNNING 1 /* thread is running normally */
#define TH_STATE_STOPPED 2 /* thread is stopped */
#define TH_STATE_WAITING 3 /* thread is waiting normally */
#define TH_STATE_UNINTERRUPTIBLE 4 /* thread is in an uninterruptible wait */
#define TH_STATE_HALTED 5 /* thread is halted at a clean point */
We have the following (messed up) mapping:
Thread state | Mapped to | htop state| 'top' state | 'ps' state
----------------------------------------------------------------------------
TH_STATE_RUNNING | SIDL(1) | 'C' | running | 'R'
TH_STATE_UNINTERRUPTIBLE | SRUN(2) | 'R' | stuck | 'U'
TH_STATE_WAITING (short) | SSLEEP(3) | 'S' | sleeping | 'S'
TH_STATE_WAITING (long) | SSTOP(4) | 'T' | idle | 'I'
TH_STATE_STOPPED | SZOMB(5) | 'Z' | stopped | 'T'
So, the real answer is: 'C' means that the process is currently running.
How did it happen? Seems that the ki->state handling was borrowed from ps source and wasn't adjusted for Unix process codes.
Update: this bug is fixed. Yay open source!

Interrupt performance on linux kernel with RT patches - should be better?

I have bumped into a bit inconsistent IRQ/ISR performance on Freescales imx.233 running linux kernel (3.8.13) with CONFIG_PREEMPT_RT patches.
I am little bit surprised why this processor (ARM9, 454mhz) is unable to keep up even with 74kHz IRQ requests.. ?
In my kernel config I have set following flags:
CONFIG_TINY_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RT_BASE=y
CONFIG_HAVE_PREEMPT_LAZY=y
CONFIG_PREEMPT_LAZY=y
CONFIG_PREEMPT_RT_FULL=y
CONFIG_PREEMPT_COUNT=y
CONFIG_DEBUG_PREEMPT=y
On the system there is basically nothing running (created by buildroot), and I set PWM to generate a pulse of 74kHz, that serves as interrupt.
Then in the ISR, I just trigger another GPIO output pin, and check the output.
What I find is that sometimes I miss an interrupt -
You can see the missed interrupt here:
And also the the triggering of output pin seems to be a bit inconsistent, the output pin is triggered usually within "5% window", that might still be acceptable. But I worry, that when I start implementing data transfer logic, instead of just triggering the pin, I might run into further problems...
My simple driver code looks like this:
#needed includes
uint16_t INPUT_IRQ = 39;
uint16_t OUTPUT_GPIO = 38;
struct test_device *device;
//Prototypes
void irqtest_exit(void);
int irqtest_init(void);
void free_device(void);
//Default functions
module_init(irqtest_init);
module_exit(irqtest_exit);
//triggering flag
uint16_t pulse = 0x1;
irqreturn_t irq_handle_function(int irq, void *device_id)
{
pulse = !pulse;
gpio_set_value(OUTPUT_GPIO, pulse);
return IRQ_HANDLED;
}
struct test_device {
int huuhaa;
};
void free_device() {
if (device)
kfree(device);
}
int irqtest_init(void) {
int result = 0;
device = kmalloc(sizeof *device, GFP_KERNEL);
device->huuhaa = 10;
printk("IRB/irqtest_init: Inserting IRQ module\n");
printk("IRB/irqtest_init: Requesting GPIO (%d)\n", INPUT_IRQ);
result = gpio_request_one(INPUT_IRQ, GPIOF_IN, "PWM input");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as input.. exiting\n", INPUT_IRQ);
return -EINVAL;
}
result = gpio_request_one(OUTPUT_GPIO, GPIOF_OUT_INIT_LOW , "IR OUTPUT");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as output.. exiting\n", OUTPUT_GPIO);
return -EINVAL;
}
//Set our desired interrupt line as input
result = gpio_direction_input(INPUT_IRQ);
if (result != 0) {
printk("IRB/irqtest_init: Failed to set IRQ as input.. exiting\n");
free_device();
return -EINVAL;
}
//Set flags for our interrupt, guessing here..
irq_flags |= IRQF_NO_THREAD;
irq_flags |= IRQF_NOBALANCING;
irq_flags |= IRQF_TRIGGER_RISING;
irq_flags |= IRQF_NO_SOFTIRQ_CALL;
//register interrupt
result = request_irq(gpio_to_irq(INPUT_IRQ), irq_handle_function, irq_flags, "irq testing", device);
if (result != 0) {
printk("IRB/irqtest_init: Failed to reserve GPIO 38\n");
return -EINVAL;
}
printk("IRB/irqtest_init: insert success\n");
return 0;
}
void irqtest_exit(void) {
if (device)
kfree(device);
gpio_free(INPUT_IRQ);
gpio_free(OUTPUT_GPIO);
printk("IRB/irqtest_exit: Removing irqtest module\n");
}
int irqtest_open(struct inode *inode, struct file *filp) {return 0;}
int irqtest_release(struct inode *inode, struct file *filp) {return 0;}
In the system, I have following interrupts registered, after the driver is loaded:
# cat /proc/interrupts
CPU0
16: 36379 - MXS Timer Tick
17: 0 - mxs-spi
18: 2103 - mxs-dma
60: 0 gpio-mxs irq testing
118: 0 - mxs-spi
119: 0 - mxs-dma
120: 0 - RTC alarm
124: 0 - 8006c000.serial
127: 68050 - uart-pl011
128: 151 - ci13xxx_imx
Err: 0
I wonder if the flags I declare to my IRQ are good ? I noticed that with this configuration, I can no longer reach console, so kernel seems totally consumed with servicing this 74kHz trigger now.. this can't be right ?
I suppose it's not a big deal for me since this is only during data transfer, but still I feel I'm doing something wrong..
Also, I wonder if it would be more efficient to map the registers with ioremap, and trigger the output with direct memory writes ?
Is there some way I could increase the priority of the interrupt even higher ? Or could I somehow lock the kernel for the duration of the data transfer (~400ms), and generate somehow else my timing for the output ?
Edit: Forgot to add /proc/interrupts output to the question...
What you experience here is interrupt jitter. This is to be expected on Linux, because the kernel regularly disables the interrupts for various tasks (entering a spinlock, handling an interrupt, etc.).
This will happen, regardless wether you have PREEMPT_RT or not, so expecting to generate 74kHz signal with regular interrupts is pretty much unrealistic.
Now, ARM has higher priority interrupts called FIQs, that will never be masked or disabled.
Linux doesn't use FIQ, and is not built to deal with the fact that an FIQ could be used, so you won't be able to use the generic kernel framework.
From Linux driver development point of view however, it's not really different as long as you keep this in mind: you have to write a handler, and associate it to an IRQ. You'll also have to poke into the interrupt controller to make it generate a FIQ for the interrupt you want to use (the details on how to change it are platform-dependant. Some platforms have functions to do that (like imx25 and mxc_set_irq_fiq), some others don't. imx23/28 don't, so you'll have to do it by hand).
The only thing that the functions to setup a fiq handler only work with a assembly-written handler, so you'll have to rewrite your handler in assembly (with your current code, it should be trivial though).
You can grab additional details to the blog post Alexandre posted (http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/), where you'll find working code, samples, and explanations on how it all works together.
You can have a look at what my colleague Maxime Ripard did using an FIQ on a similar SoC (i.mx28) :
http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/
Try this flags:
int irq_flags;
...
irq_flags = IRQF_TRIGGER_RISING | IRQF_EARLY_RESUME
I had a kernel 3.8.11 and can't find IRQF_NO_SOFTIRQ_CALL define. It's only for 3.8.13?
Also I didn't see irq_flags define. Where is it?

Multithreaded synchronization primitive

I have the following scenario:
I have multiple worker threads running that all go through a certain section of code, and they're allowed to do so simultaneously. No critical section surrounds this piece of code right now as it's not required for these threads.
I have a main thread that also -occassionally- wants to enter that section of code, but when it does, none of the other worker threads should use that section of code.
Naive solution: surround the section of code with a critical section. But that would kill a lot of parallelism between the worker threads, which is important in my case.
Is there a better solution?
Use RW locks. RW locks allow multiple readers and only a single writer. Your workers would call read-lock at the start of the critical section and the main thread would write-lock.
By definition, when calling read-lock, the calling process will wait for any writing threads to finish. When calling write-lock, the calling process will wait for any reading or writing threads to finish.
Example using POSIX threads:
pthread_rwlock_t lock;
/* worker threads */
void *do_work(void *args) {
for (int i = 0; i < 100; ++i) {
pthread_rwlock_rdlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
pthread_exit(0);
}
/* main thread */
int main(void) {
pthread_t workers[4];
pthread_rwlock_init(&lock);
int i;
// spawn workers...
for (i = 0; i < 4; ++i) {
pthread_create(workers[i]; NULL, do_worker, NULL);
}
for (i = 0; i < 100, ++i) {
pthread_rwlock_wrlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
return 0;
}
As far as I understand it, your worker threads are started asynchronously. So when the main thread wants to run this code section, you have to ensure that no worker thread is executing it. Therefore you have to stop all worker threads before the main thread can enter that code section, and allow them to enter it again afterwards.
This could be done - using Grand Central Dispatch - if your worker threads would be assigned to a dispatch group, see https://developer.apple.com/library/mac/#documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html.
The main thread could then send the message dispatch_group_wait to this dispatch group, wait for all worker thread to leave this code section, execute it, and then requeue the worker threads.

Resources