Linux PCIe driver and app showing high CPU Usage - linux-kernel

I've a custom Xilinx PCIe Endpoint Hardware, I've written a linux driver for it and also a sample app to test it.
Driver loads correctly and Device is also recognized. Then the CPU Usage is also stable until I run my application.
When I run my application, one out of my 4 Cores are Hitting 100% when other cores stay below 10%. Then opening any other application (may be firefox browser as I did), The system completely Hangs and requires a hard restart to get back to normal.
The process wise CPU Usage shows only my application at 25% and remaining all stays at 0-1%.
The communication between the Driver and Application is only the Interrupts. When there is an MSI Interrupt, the read call on the device file in the application unblocks and again application starts waiting for another interrupt. Also I access the BAR Regions from the application using resource files.
Why does only one core is showing 100% CPU Usage? and Why my system completely hangs when another application is started even when 3 cores are almost completely free?

Found the issue.
In the app I run, 4 Threads are created which will handle each PCIe Interrupt. Apart from that, The main function polls on a global variable with an empty while loop. This is the reason for high CPU Usage. Figured out a way and used usleep instead of while loop and done.
CPU Usage is less than 20% now.
Thanks for your comments.

Related

How is the instruction memory initialized?

In my book, in the chapter where they create the CPU (chapter 7), they already assume that the instruction memory contains the instructions in machine code.
In an earlier chapter (chapter 6) this is written about start-up:
On start-up, the processor jumps to the reset vector and begins
execution boot loader code in supervisor mode. The boot loader
typically configures the memory system, initializes the stack pointer,
and reads the OS from disk; then it begins a much longer boot process
in the OS. The OS eventually will load a program, change to
unprivileged user mode, and jump to the start of the program.
But from what I understand the reset vector and the boot loader code must be in memory? Is this correct? Has my book skipped a part before the CPU jumps to the reset vector, and forgot about
how the reset vector and bootloader are loaded into memory? How does the CPU get them into memory?
All CPUs have a fixed start address. This is set in hardware (maybe you can configure it through jumpers but that's it, because the CPU has to start somewhere).
The first instructions are again set in hardware, at such fixed address, usually through a hard coded memory (like a flash). There likely is a piece of hardware that translates accesses to an address-based location to flash (NAND memory), so that means that the flash, even if it's not part of the CPU, is memory mapped.
Some processors do a memory remap, meaning that you will have those addresses accessible for other things, as you likely don't need the first stage bootloader anymore.
We can explore further by taking as an example the STM32 boot process:
Configurable boot mode though physical pins and jumpers:
This means that the CPU can start fetching instructions at startup from different locations, defined by those pins.
Factory bootloader:
The bootloader is stored in the internal boot ROM (system memory part of the flash) of any STM32 device, and is programmed by ST during
production. Its main task is to download the application program to the internal flash memory through one of the available serial peripherals, such as USART, CAN, USB, I2C, or SPI.
So this means that if the factory bootloader is selected, the CPU will start execute a program that then, by means of the selected communication protocol (USART, CAN etc..) can fetch a program from another device. This is useful if you have another processor needed to program your device once already mounted on the PCB.
Another option - Write directly to the internal flash
Another option is to select the internal flash. Since this is a persistent memory it can be programmed externally and when the CPU will start it will find, at 0x8000000, the first instruction to execute. The last section of the page that I linked explain the boot process.

High bandwidth Networking and the Windows "System Interrupts" Process

I am writing a massive UDP network application.
Running traffic at 10gigabits per second.
I have a very high "System Interrupts" CPU usage in task manager.
Reading about what this means, I see:
What Is the “System Interrupts” Process?
System Interrupts is an official part of Windows and, while it does
appear as a process in Task Manager, it’s not really a process in the
traditional sense. Rather, it’s an aggregate placeholder used to
display the system resources used by all the hardware interrupts
happening on your PC.
However most articles say that a high value corresponds with failing hardware.
However, since the "system interrupts" entry correlates to high IRQ usage, maybe this should be high considering my large UDP network usage.
Also, is all of this really happenning on one CPU core? Or is this an aggregate of all things happening across all CPU cores.
If you have many individual datagrams being sent over UDP, it's certainly going to cause a lot of hardware interrupts, and a lot of CPU usage. 10 Gb is certainly in the range of "lots of CPU" if your datagrams are relatively small.
Each CPU has its own hardware interrupts. You can see how spread out the load is over cores on the performance tab - the red line is the kernel CPU time, which includes hardware interrupts and other low-level socket handling by the OS.

Force windows onto one CPU, and then take over the rest

I've seen various RTOSes that have this strategy that they have windows boot on one or more CPUs and then run realtime programs on the rest of the CPUs. Any idea how this might be accomplished? Can I let the computer boot off two CPUs and then stop execution on the rest of the CPUs? What documentation should I start looking at? I have enough experience with the linux kernel that I might be able to figure out how to do it under linux, so if there's anything that maps onto linux well that you could describe it in terms of, that'd be fantastic.
You can boot Windows on fewer CPUs than available easily. Run msconfig.exe, go to the Boot tab, click the Advanced options... button, check the number of processors box and set the desired number (this is for Windows 7, the exact location for Vista and XP might differ slightly).
But that's just a solution to a very small part of the problem.
You will need to implement a special kernel-mode driver to start those other CPUs (Windows won't let you do that sort of thing from non-kernel-mode code). And you will need to implement a thread scheduler for those CPUs and a bunch of other low-level things... You might want to steal some physical memory (RAM) from Windows as well and implement a memory manager as well and those two may be a very involved thing.
What to read? The Intel/AMD CPU documentation (specifically the APIC part), the x86 Multiprocessor specification from Intel, books on Windows drivers, Windows Internals books, MSDN, etc.
You can't turn off Windows on one CPU and expect to run your program as usual because syscalls are serviced by the same CPU that the thread issuing the syscall is issued on. The syscall relies on kernel-mode accessible per-thread data to handle the syscalls, and hence any thread (usermode or kernel-mode) can only run when Windows has performed the per-core initialization of the CPU.
It seems likely that you're writing a super-double-mega-awesome app that really-definitely needs to run, like, super-fast and you want everyone else to get off the core, 'cos then, like, you'll be the totally fastest-est, but you're not really appreciating that if Windows isn't on your core, then you can't use ANY part of Windows on that core either.
If you really do want to do this, you'll have to run as a boot-driver. The boot-driver will be able to reserve one of the cores from being initialized during boot, preventing Windows from "seeing" that core. You can then manually construct your own thread of execution to run on that core, but you'll need to handle paging, memory allocation, scheduling, NUMA, NMI exceptions, page-faulting, and ACPI events yourself. You won't be able to call Windows from that core without bluescreening Windows. You'll be on your own.
What you probably want to do is to lock your thread to a single processor (via SetThreadAffinity) and then up the priority of your thread to the maximum value. When you do so, Windows is still running on your core to service things like pagefaults and hardware interrupts, but no lower priority user-mode thread will run on that core (they'll all move to other cores unless they are also locked to your processor).
I could not understand the question properly. But if you asking for scheduling process to cores then linux can accomplish this using set affinity. Follow this page :
http://www.kernel.org/doc/man-pages/online/pages/man2/sched_setaffinity.2.html

MS-Windows scheduler control (or otherwise) -- test application performance on slower CPU?

Is there some tool which allows one to control the MS-Windows (XP-SP3 32-bit in my case) scheduler, s.t. a target application (which I'd like to test), operates as if it is running on a slower CPU. Say my physical host is a 2.4GHzv Dual-Core, but I'd like the application to run as if, it is running on a 800MHz/1.0GHz CPU.
I am aware of some such programs which allowed old DOS games to run slower, but AFAIK, they take the approach of consuming CPU cycles to starve the application. I do not want such a thing, and also would like to have higher precision control on the clock.
I don't believe you'll find software that directly emulates the different CPUs. But something like ProcessLasso would let you control a programs CPU usage. Thus simulating, in a way, a slower clock speed.
I also found this blog entry with many other ways to throttle your CPU: Windows CPU throttling techniques
Additionally, if you have access to VMWare you could setup a resource pool with a limited CPU reservation.

Emulating a processor's (limited) resources, including clock speed

I would like a software environment in which I can test the speed of my software on hardware with specific resources. For example, how fast does this program run on an 800MHz x86 with 24 Mb of RAM, when my host hardware is a 3GHz quad core amd64 with 12GB of RAM? Emulators such as qemu make a great point of running "almost as fast" as the underlying hardware; I would like to make it run slower. Is there a way to do that?
I have never tried it, but perhaps you could achieve what you want to some extent by combining an emulator like QEMU or VirtualBox on Linux with something like this:
http://cpulimit.sourceforge.net/
If you can limit the CPU time available to the emulator you might be able to simulate the results of execution on a slower computer. Keep in mind, though, that this would only affect the execution speed (or so I hope, anyway).
The CPU instruction set and other system features would remain unchanged. This means that emulating a specific processor accurately would be difficult if not impossible.
In addition, using something like cpulimit, which works using SIGSTOP and SIGCONT to repeatedly stop/restart the emulator process might cause side-effects, such as timing inconsistencies, video display artifacts etc.
In your emulator, keep a virtual "clock" and increment it appropriately as you execute each instruction. From there you can simply report how long it took in virtual time to execute, or you can have your emulator sleep now and again to keep execution speed roughly where it would be in the target.

Resources