why does the linux kernel thread hog up cpu - linux-kernel

I have created a kernel thread using kthread_run in a kernel module.
The thread is very simple, just like bellow.
static int my_thread_func(void * data)
{
int a;
DBG_PRINT("policy:%lu; prio:%d", current->policy, current->prio);
while (!kthread_should_stop())
{
a++;
}
}
However, after I loaded the module, the system did not response any more.
So I wonder what's the schedule policy and priority of this kernel thread.
Then I try to print out the schedule policy and priority of this kernel thread,
and got bellow output.
policy:0; prio:120
policy:0 means SCHED_NORMAL;
prio:120 this is also not high.
While the thread does not have a SCHED_FIFO or SCHED_RR schedule policy, Why can it hog up the cpu?
And I also found that if I insert some sleep code in the loop body of the thread, the system could remain responsive.
And I also found when I run a userspace program implemented as bellow, the system remained responsive, too.
int main(int argc, char *argv[])
{
int a;
while (1) a++;
return 0;
}
So who can tell me, why the kernel thread could hog up the cpu.

When you say the priority is 120, are you observing the priority of kthreadd, or the actual kernel thread that was created for you?
Please see http://lxr.free-electrons.com/source/kernel/kthread.c#L310 for the function that is used to create new kernel threads.
Excerpt:
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data, int node,
const char namefmt[],
...)
{
...
if (!IS_ERR(task)) {
static const struct sched_param param = { .sched_priority = 0 };
...
/*
* root may have changed our (kthreadd's) priority or CPU mask.
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
set_cpus_allowed_ptr(task, cpu_all_mask);
}
...
}
It appears that the sched_priority is set to 0.
Now, please look at http://lxr.free-electrons.com/source/include/linux/sched/prio.h#L9
Excerpt:
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
* tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
* values are inverted: lower p->prio value means higher priority.
*
* The MAX_USER_RT_PRIO value allows the actual maximum
* RT priority to be separate from the value exported to
* user-space. This allows kernel threads to set their
* priority to a value higher than any user task. Note:
* MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
*/
Please note the second paragraph: This allows kernel threads to set their priority to a value higher than any user task.
Summary:
It appears that newly created kernel threads are set to sched_priority of 0 even with SCHED_NORMAL. Priority of 0 is not a normal priority for SCHED_NORMAL, and thus the kernel thread will take priority over any other thread (that isn't using an RT policy---and I don't think any processes in the kernel use RT by default).
Addendum:
Note: I am not 100% sure if this is the reason. But, if you look at the comments in the kernel for kernel threads, they all seem to imply that a kernel thread keeps running UNLESS:
The kernel thread itself yields or calls do_exit.
Someone else calls kthread_should_stop().
Which to me, sounds like the kernel thread runs as long as it wants, until it decides to stop, or someone else explicitly tells it to stop.

Related

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:
// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
CYCLES t1 = rdtscp();
load = *((int *)p);
CYCLES t2 = rdtscp();
return (t2-t1);
}
void send_bit(int one, void *addr) {
if(one) {
clflush((void *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
clflush((void *)addr);
}
else {
x = *((int *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
}
}
int main(int argc, char **argv) {
if(argc == 2)
{
bit = atoi(argv[1]);
}
// transmit bit
init_address(DEFAULT_FILE_NAME);
send_bit(bit, address);
return 0;
}
The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.
The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:
int main(int argc, char **argv) {
init_address(DEFAULT_FILE_NAME);
rdtscp();
load__latency = load_latency((void *)address);
printf("load latency = %d\n", load__latency);
return 0;
}
(I ran the receiver manually after the sender process terminated)
However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.
Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?
The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.
You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.
(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)
Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.
It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.
(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)

Workaround for a certain IOServiceOpen() call requiring root privileges

Background
It is possible to perform a software-controlled disconnection of the power adapter of a Mac laptop by creating an DisableInflow power management assertion.
Code from this answer to an SO question can be used to create said assertion. The following is a working example that creates this assertion until the process is killed:
#include <IOKit/pwr_mgt/IOPMLib.h>
#include <unistd.h>
int main()
{
IOPMAssertionID neverSleep = 0;
IOPMAssertionCreateWithName(kIOPMAssertionTypeDisableInflow,
kIOPMAssertionLevelOn,
CFSTR("disable inflow"),
&neverSleep);
while (1)
{
sleep(1);
}
}
This runs successfully and the power adapter is disconnected by software while the process is running.
What's interesting, though, is that I was able to run this code as a regular user, without root privileges, which wasn't supposed to happen. For instance, note the comment in this file from Apple's open source repositories:
// Disables AC Power Inflow (requires root to initiate)
#define kIOPMAssertionTypeDisableInflow CFSTR("DisableInflow")
#define kIOPMInflowDisableAssertion kIOPMAssertionTypeDisableInflow
I found some code which apparently performs the actual communication with the charger; it can be found here. The following functions, from this file, appears to be of particular interest:
IOReturn
AppleSmartBatteryManagerUserClient::externalMethod(
uint32_t selector,
IOExternalMethodArguments * arguments,
IOExternalMethodDispatch * dispatch __unused,
OSObject * target __unused,
void * reference __unused )
{
if (selector >= kNumBattMethods) {
// Invalid selector
return kIOReturnBadArgument;
}
switch (selector)
{
case kSBInflowDisable:
// 1 scalar in, 1 scalar out
return this->secureInflowDisable((int)arguments->scalarInput[0],
(int *)&arguments->scalarOutput[0]);
break;
// ...
}
// ...
}
IOReturn AppleSmartBatteryManagerUserClient::secureInflowDisable(
int level,
int *return_code)
{
int admin_priv = 0;
IOReturn ret = kIOReturnNotPrivileged;
if( !(level == 0 || level == 1))
{
*return_code = kIOReturnBadArgument;
return kIOReturnSuccess;
}
ret = clientHasPrivilege(fOwningTask, kIOClientPrivilegeAdministrator);
admin_priv = (kIOReturnSuccess == ret);
if(admin_priv && fOwner) {
*return_code = fOwner->disableInflow( level );
return kIOReturnSuccess;
} else {
*return_code = kIOReturnNotPrivileged;
return kIOReturnSuccess;
}
}
Note how, in secureInflowDisable(), root privileges are checked for prior to running the code. Note also this initialization code in the same file, again requiring root privileges, as explicitly pointed out in the comments:
bool AppleSmartBatteryManagerUserClient::initWithTask(task_t owningTask,
void *security_id, UInt32 type, OSDictionary * properties)
{
uint32_t _pid;
/* 1. Only root processes may open a SmartBatteryManagerUserClient.
* 2. Attempts to create exclusive UserClients will fail if an
* exclusive user client is attached.
* 3. Non-exclusive clients will not be able to perform transactions
* while an exclusive client is attached.
* 3a. Only battery firmware updaters should bother being exclusive.
*/
if ( kIOReturnSuccess !=
clientHasPrivilege(owningTask, kIOClientPrivilegeAdministrator))
{
return false;
}
// ...
}
Starting from the code from the same SO question above (the question itself, not the answer), for the sendSmartBatteryCommand() function, I wrote some code that calls the function passing kSBInflowDisable as the selector (the variable which in the code).
Unlike the code using assertions, this one only works as root. If running as a regular user, IOServiceOpen() returns, weirdly enough, kIOReturnBadArgument (not kIOReturnNotPrivileged, as I would have expected). Perhaps this has to do with the initWithTask() method above.
The question
I need to perform a call with a different selector to this same Smart Battery Manager kext. Even so, I can't even get to the IOConnectCallMethod() since IOServiceOpen() fails, presumably because the initWithTask() method prevents any non-root users from opening the service.
The question, therefore, is this: how is IOPMAssertionCreateWithName() capable of creating a DisableInflow assertion without root privileges?
The only possibility I can think of is if there's a root-owned process to which requests are forwarded, and which performs the actual work of calling IOServiceOpen() and later IOConnectCallMethod() as root.
However, I'm hoping there's a different way of calling the Smart Battery Manager kext which doesn't require root (one that doesn't involve the IOServiceOpen() call.) Using IOPMAssertionCreateWithName() itself is not possible in my application, since I need to call a different selector within that kext, not the one that disables inflow.
It's also possible this is in fact a security vulnerability, which Apple will now fix in a future release as soon as it is alerted to this question. That would be too bad, but understandable.
Although running as root is a possibility in macOS, it's obviously desirable to avoid privilege elevation unless absolutely necessary. Also, in the future I'd like to run the same code under iOS, where it's impossible to run anything as root, in my understanding (note this is an app I'm developing for my own personal use; I understand linking to IOKit wipes out any chance of getting the app published in the App Store).

How can I measure how long this Linux interrupt handler takes to run?

I am trying to debug a custom Linux serial driver that is having some issues missing some receive data. It has one interrupt for 4 serial ports, and baud rate is 115200. Firstly I would like to see how to measure how long the interrupt handler takes. I have used perf, but things are just in percent and not seconds. Secondly does anyone see any issues with the below code that can be improved to speed things up?
void serial_interrupt(int irq, void *dev_id)
{
...
// Need to loop through each port to see which port caused the interrupt.
list_for_each(lpNode, &serial_ports)
{
struct serial_port_module *ser_dev = list_entry(lpNode, struct serial_port_module, port_list);
lnIsr = ioread8(ser_dev->membase + ser_dev->chan_num * PORT_OFFSET + SERIAL_ISR);
if (lnIsr & IPM512_RX_INT)
{
while (serialdata_is_data_available(ser_dev)) // equals a ioread8()
{
lcIn = ioread8(ser_dev->membase + ser_dev->chan_num * PORT_OFFSET + SERIAL_RBR);
kfifo_in(&ser_dev->rx_fifo, &lcIn, sizeof(lcIn));
// Notify if anyone is doing a blocking read.
wake_up_interruptible(&ser_dev->read_queue);
}
}
}
}
Use the ftrace API to try to track down your latency issues. It's woth the time to get to know: https://www.kernel.org/doc/Documentation/trace/ftrace.txt
If this is too heavy-weight, what about adding some simple instrumentation yourself? getnstimeofday(struct timespec *ts) is relatively lightweight... with a little code you could output in a sysfs debug file the worst case execution times, some stats on latency of call to this function, worst-case number of bytes available per interrupt... if this number gets near your hardware FIFO size, you're in trouble.
One optimization would be to read the data in batches into a buffer, as long as data is available, then input the entire buffer, then wake up any readers.
while(data_available(dev))
{
buf[cnt++] = ioread8();
}
kfifo_in(fifo, buf, cnt);
wake_up_interruptible();
But execution time of code this simple is not likely to be an issue. You're probably suffering from missed interrupts or unexpected latency of the interrupt handling.

What does virtual core in YARN vcore mean?

Yarn is using the concept of virtual core to manage CPU resources. I would ask what's the benefit to use virtual core, is there some reason here that YARN uses vcore?
Here is what the documentation states (emphasis mine)
A node's capacity should be configured with virtual cores equal to its
number of physical cores. A container should be requested with the
number of cores it can saturate, i.e. the average number of threads it
expects to have runnable at a time.
Unless the CPU core is hyper-threaded it can run only one thread at a time (in case of hyper threaded OS actually sees 2 cores for one physical core and can run two threads - of course it's a bit of cheating and no-where as efficient as having actual physical core). Essentially what it means to end user is that a core can run a single thread so theoretically if I want parallelism using java threads then a reasonably good approximation is number of threads equal to number of core. So if your container process ( which is a JVM)
will require 2 threads then it's better to map it to 2 vcore - that what the last line means. And as total capacity of node the vcore should be equal to number of physical cores.
The most important thing to remember is still that it's actually the OS which will schedule the threads to be executed in different cores as it happens in any other application and
YARN in itself does not have control on it except the fact that what is the best possible approximation for how many thread to allocate for each container. And that's why it is important to take into consideration other applications running on OS, CPU cycles used by kernel etc., as all of cores will not be available to YARN application all the time.
EDIT: Further research
Yarn does not influence hard limits on CPU but Going through the code I can see how it tries to influence the CPU scheduling or cpu rate. Technically Yarn can launch different container processes - java, python , custom shell command etc. The responsibility of launching containers in Yarn belongs to the ContainerExecutor component of Node manager and I can see code for launching the container etc., along with some hints (depending on platform). For example in case of DefaultContainerExecutor ( which extends ContainerExecutor) - for windows it uses "-c" parameter for cpu restriction and on linux it uses process niceness to influence it. There is another implementation LinuxContainerExecutor (or better still CgroupsLCEResourcesHandler as former does not force the usage of cgroups) which tries to use Linux cgroups to limit the Yarn CPU resources on that node. More details can be found here.
ContainerExecutor {
.......
.......
protected String[] getRunCommand(String command, String groupId,
String userName, Path pidFile, Configuration conf, Resource resource) {
boolean containerSchedPriorityIsSet = false;
int containerSchedPriorityAdjustment =
YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY;
if (conf.get(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY) !=
null) {
containerSchedPriorityIsSet = true;
containerSchedPriorityAdjustment = conf
.getInt(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY,
YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY);
}
if (Shell.WINDOWS) {
int cpuRate = -1;
int memory = -1;
if (resource != null) {
if (conf
.getBoolean(
YarnConfiguration.NM_WINDOWS_CONTAINER_MEMORY_LIMIT_ENABLED,
YarnConfiguration.DEFAULT_NM_WINDOWS_CONTAINER_MEMORY_LIMIT_ENABLED)) {
memory = resource.getMemory();
}
if (conf.getBoolean(
YarnConfiguration.NM_WINDOWS_CONTAINER_CPU_LIMIT_ENABLED,
YarnConfiguration.DEFAULT_NM_WINDOWS_CONTAINER_CPU_LIMIT_ENABLED)) {
int containerVCores = resource.getVirtualCores();
int nodeVCores = conf.getInt(YarnConfiguration.NM_VCORES,
YarnConfiguration.DEFAULT_NM_VCORES);
// cap overall usage to the number of cores allocated to YARN
int nodeCpuPercentage = Math
.min(
conf.getInt(
YarnConfiguration.NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT,
YarnConfiguration.DEFAULT_NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT),
100);
nodeCpuPercentage = Math.max(0, nodeCpuPercentage);
if (nodeCpuPercentage == 0) {
String message = "Illegal value for "
+ YarnConfiguration.NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT
+ ". Value cannot be less than or equal to 0.";
throw new IllegalArgumentException(message);
}
float yarnVCores = (nodeCpuPercentage * nodeVCores) / 100.0f;
// CPU should be set to a percentage * 100, e.g. 20% cpu rate limit
// should be set as 20 * 100. The following setting is equal to:
// 100 * (100 * (vcores / Total # of cores allocated to YARN))
cpuRate = Math.min(10000,
(int) ((containerVCores * 10000) / yarnVCores));
}
}
return new String[] { Shell.WINUTILS, "task", "create", "-m",
String.valueOf(memory), "-c", String.valueOf(cpuRate), groupId,
"cmd /c " + command };
} else {
List<String> retCommand = new ArrayList<String>();
if (containerSchedPriorityIsSet) {
retCommand.addAll(Arrays.asList("nice", "-n",
Integer.toString(containerSchedPriorityAdjustment)));
}
retCommand.addAll(Arrays.asList("bash", command));
return retCommand.toArray(new String[retCommand.size()]);
}
}
}
For windows (it utilizes winutils.exe) , it uses cpu rate
For Linux it uses niceness as a parameter to control the CPU priority
"Virtual cores" are merely an abstraction of actual cores. This abstraction or "lie" (as i like to call it), allows YARN (and others) to dynamically spin threads (parallel process) based on availability. Take for example running map reduce on an "elastic" cluster with a processing limit constrained only by your wallet... The cloud baby... The. Cloud.
you can read more here

Condition Variable alternatives (c/c++ on windows xp)

I want to write a thread which runs tasks from an unlimited-size container of tasks.
While the task-list is empty the thread trying to get a task should be blocked.
Coming from Linux I wanted to use condition variable which will be signaled on task adding and will be waited while the list is empty.
I found that CONDITION_VARIABLE is available only from windows Vista, so this is out of question.
Semaphores are problematic too due to the unlimited-size restriction.
Is there any apropriate subtitution?
Thanks
Why do you say that semaphores are problematic? Linux/Windows both have semaphores with a maximum count that can be realistically be described as 'Unlimited'.
Use James' suggestion on Windows - it will work fine. Init. your semaphore with zero count. Add a task to your big (thread-safe), container, then signal the semaphore. In the thread, wait on the semaphore, then get a task from your container and process it. You can pass the semaphore instance to multiple threads if you wish - that will work OK as well.
Rgds,
Martin
Sounds like you want a Win32 kernel event. See CreateEvent.
WaitForSingleObject and CreateSemaphore?
Thanks all,
thats my conclusion:
void ThreadPool::ThreadStartPoint(ThreadPool* tp)
{
while (1)
{
WaitForSingleObject(tp->m_taskCountSemaphore,INFINITE); // while (num of tasks==0) block; decreament num of tasks
BaseTask* current_task = 0;
// get top priority task
EnterCriticalSection (&tp->m_mutex);
{
current_task = tp->m_tasksQue.top();
tp->m_tasksQue.pop();
}
LeaveCriticalSection (&tp->m_mutex);
current_task->operator()(); // this is not critical section
current_task->PostExec();
}
}
void ThreadPool::AddTask(BaseTask& _task)
{
EnterCriticalSection (&m_mutex);
{
m_tasksQue.push(&_task);
_task.PrepareTask(m_mutex);
}
LeaveCriticalSection (&m_mutex);
if (!ReleaseSemaphore(m_taskCountSemaphore,
1, // increament num of tasks by 1
NULL // don't store previuos num of tasks value
))
{//if failed
throw ("semaphore release failed");
}
}

Resources