Profiling OpenMP: update_cfs_rq_blocked_load takes majority of runtime

Profiling OpenMP: update_cfs_rq_blocked_load takes majority of runtime - linux-kernel

I am profiling some OpenMP code using Intel's amplxe. It's giving me some confusing results and I am not sure if something weird is going on with the profiler.
I use __itt_pause(); and __itt_resume(); to isolate the OpenMP section of the code, but my bottom-up view looks like this:
_raw_spin_lock vmlinux 23.474s
update_cfs_rq_blocked_load vmlinux 15.215s
__schedule vmlinux 10.695s
put_prev_task_fair vmlinux 7.918s
update_cfs_shares vmlinux 7.447s
Could this be the effect of overparallelised OpenMP region or is it something weird that my profiler is reporting? I was expecting to see more OpenMP related information in the profiler result, but they are all at the bottom and the runtime is dominated by kernel based routines.
Could someone explain if those are likely things one would see in OpenMP profiles and what exactly do the cfs functions do? (My understanding is that... there's a lot of locking going on in the scheduler!?!)

Related

When we get runtime error in swift project, Why does Xcode send us to Thread output in assembly language? What's the point ?

As you know when there is somethings wrong when we are running a Swift project in Xcode we will direct to tread debug navigator's thread section and we will be face with some assembly code like this :
I am wondering is there any reference, tutorial or tools for understanding these codes , there should be reasone that we direct to these code
let me clear; I know how to fix the errors but this suffering me when I do not understand some thing like this. I want to know what are these codes and how we can use them or at least understand them.
Thanks :)

Original question: what language is that? That's AT&T syntax assembly language for x86-64. https://stackoverflow.com/tags/x86/info for manuals from Intel and other resources, and https://stackoverflow.com/tags/att/info for how AT&T syntax differs from Intel syntax used in most manuals. (I think the x86 tag wiki has a few AT&T syntax tutorials.) Most AT&T-syntax disassemblers have an intel-syntax mode, too, so you can use that if you want asm that matches Intel's manuals.
What's the point?
The point is so you can debug your program if you know asm. Or you can show the asm to someone who does understand it, or include it in a bug report.
Did you compile without debug symbols? Or did it crash in library code without symbols? It's normal for debuggers to show you asm if it can't show you source, or if you ask for asm.
If you have debug symbols for your own code, you can at least backtrace into parent functions for which you do have source. (Unless the stack is corrupted.)
Did your program fault on that instruction highlighted in pink? That's a bit odd, since it's loading from static data (a RIP-relative load means the address is a link-time constant).
Did you maybe munmap or mprotect that page of your program's data or text segment so a load would fault? Normally you only get faults when an addressing mode involves a pointer.
(The call *0x1234(%rip) right before it is calling through a function pointer, though. The function-pointer is stored in memory, but code-fetch after the call executes would fault if it was pointing to an unmapped or non-executable page). But your first image shows you got a SIGABRT, not SIGSEGV, so that's more like the program on purpose aborted after failing an assertion.
I believe majority of swift coders don't know asm
There's nothing more useful a debugger can do without debug symbols and source files.
Also keep in mind that the majority of debugger authors do know asm, so for them it is an obviously-useful feature / behaviour. They know that many people won't be able to benefit from it, but that some will.
Asm is what's really running on the machine. Without asm, you couldn't find wrong-code compiler bugs, etc. etc. As far as software bugs, there is no lower level than asm, so it's not some arbitrary choice of some lower-level layer to stop at.
(Unless there's also a bug in your disassembler or debugger, in which case you need to check the hex machine code.)

How to de-optimize the Linux kernel to avoid value optimized out

I am debugging Linux kernel.
I compile the kernel with -O1 optimization level. (Note that the Linux kernel cannot be compiled with -O0).
When using gdb to debug it, I found some values are optimized out. As shown in the following figure. The len, flags, and add_len arguments are all optimized out.
How can I de-optimize the Linux kernel to avoid making these variables being optimized out?

Building with -Og is supposed to eliminate these problems.
I don't know whether Linux kernel can be compiled that way.
Note that often you can discover the "optimized out" value by going up or down the stack, e.g. if the caller looks like this:
udp_recvmsg(sk, foo->msg, foo->msglen, ...);
then looking at *foo in the caller will tell you the len despite it being optimized out in the udp_recvmsg itself.

Runtime system for Stm32F103 Arm, GNAT Ada compiler

Id like to use Ada with Stm32F103 uc, but here is the problem - there is no build-in runtime system within GNAT 2016. There is another cortex-m3 uc by TI RTS included - zfp-lm3s, but seems like it needs some global updates, simple change of memory size/origin doesn't work.
So, there is some questions:
Does some body have RTS for stm32f103?
Is there any good books about low-level staff of cortex-m3 or other arm uc?
PS. Using zfp-lm3s rises this error, when i try to run program via GPS:
Loading section .text, size 0x140 lma 0x0
Load failed

The STM32F series is from STMicroelectronics, not TI, so the stm32f4 might seem to be a better starting point.
In particular, the clock code in bsp/setup_pll.adb should need only minor tweaking; use STM’s STM32CubeMX tool (written in Java) to find the magic numbers to set up the clock properly.
You will also find that the assembler code used in bsp/start*.S needs simplifying/porting to the Cortex-M3 part.
My Cortex GNAT Run Time Systems project includes an Arduino Due version (also Cortex-M3), which has startup code written entirely in Ada. I don’t suppose the rest of the code would help a lot, being based on FreeRTOS - you’d have to be very very careful about memory usage.

I stumbled upon this question while looking for a zfp runtime specific to the stm32l0xx boards. It doesn't look like one exists from what I can see, but I did stumble upon this guide to creating a new runtime from AdaCore, which might help anyone stuck with the same issue:
https://blog.adacore.com/porting-the-ada-runtime-to-a-new-arm-board

What's the difference between _int_malloc and malloc (in Valgrind)

I am amazed that I can't find any document stating the difference between _int_malloc and malloc in the output of Valgrind's callgrind tool.
Could anybody explain what's their difference?
Furthermore, I actually write C++ code, so I am using exclusively new not malloc, but in the callgrind output only mallocs are showing up.

The malloc listed in the callgrind output will be the implementation of malloc provided by the glibc function __libc_malloc in the file glibc/malloc/malloc.c.
This function calls another function, intended for internal use only, named _int_malloc, which does most of the hard work.
As writing standard libraries is very difficult, the authors must be very good programmers and therefore very lazy. So, instead of writing memory allocation code twice, the new operator calls malloc in order to get the memory it requires.

How can I get a list of legal ARM opcodes from gcc (or elsewhere)?

I'd like to generate pseudo-random ARM instructions. Via assembler directives, I can tell gcc what mode I'm in, and it will complain if I try a set of opcodes and operands that's not legal in that mode, so it must have some internal listing of what can be done in which mode. Where does that live? Would it be easier to extract that info from LLVM?
Is this question "not even wrong"? Should I try a different approach entirely?

To answer my own question, this is actually really easy to do from arm.md and and constraints.md in gcc/config/arm/. I probably spent more time answering asking this question and answering comments for it than I did figuring this out. Turns out I just need to look for 'TARGET_THUMB1', until I get around to implementing thumb2.

For the ARM family the buck stops at the ARM ARM (ARM Architectural Reference Manual). There is an ARM instruction set section and a Thumb instruction set section. Within both each instruction tells you what generation (ARMvX where X is some number like 4 (arm7), or 5 (arm9 time frame) ,etc). Since the opcode and pseudo code is listed for each instruction you should be able to figure out what is a real instruction and, if any, are syntax to save typing on another (push and pop for example).
With the Cortex-m3 and thumb2 in particular you also need to look at the TRM (Technical Reference Manual) as well. ARM has, I forget the name, a universal syntax they are trying to use that should work on both Thumb and ARM. For example on an ARM you have three register instructions:
add r1,r1,r2
In thumb there are only two register operations
add r1,r2
The desire basically is to meet in the middle or I would say more accurately to encourage ARM assemblers to parse Thumb instructions and encode them with the equivalent ARM instruction without complaining. This may have started with thumb and not thumb2, I have always separated the two syntaxes in my code until recently (and I still generally use ARM syntax for ARM and Thumb for Thumb).
And then yes you have to see what the specific implementation of the assembler tool is, in your case binutils. And it sounds like you have found the binutils/gnu secret decoder ring.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio