I strace'd a java process that was triggering a lot of kernel time to see what syscalls were being used, and was surprised to see that gettimeofday() and clock_gettime() dominated (I suspect it's due to logging), which is strange considering that man vdso states:
When tracing systems calls with strace(1), symbols (system calls) that are exported by the vDSO will not appear in the trace output.
How come these system calls are happening? Is there a way to avoid them?
The machine is running Ubuntu 16.04.1 on EC2.
To make things easier, I created a minimal test program in C (testgtod.c):
#include <stdlib.h>
#include <sys/time.h>
void main(void)
{
struct timeval tv;
for(int i = 0; i < 1000; i++) {
/* glibc wrapped, shouldn't actually syscall */
gettimeofday(&tv, NULL);
}
}
I then compiled and ran the program under strace: gcc testgtod.c -o testgtod && sudo strace ./testgtod
The output included a thousand calls to gettimeofday(), despite my expectation.
Things I tested to make sure I'm not seeing things:
Made sure the binary is a 64-bit elf using file
ldd ./testgtod to make sure vDSO is active:
linux-vdso.so.1 => (0x00007ffcee25d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6f6e161000)
/lib64/ld-linux-x86-64.so.2 (0x0000559ed71f3000)
getauxval(AT_SYSINFO_EHDR) != NULL
Replaced gettimeofday(&tv, NULL) calls with syscall(SYS_gettimeofday, &tv, NULL), increased number of calls to 10 million, ran under time - runtime behavior was the same in both cases: ./testgtod 0.16s user 0.83s system 99% cpu 0.998 total.
The issue is related to this being a VM running on Xen, specifically, the Xen clocksource does not yet allow for vDSO access to the clock:
ubuntu#machine:~% cat /sys/devices/system/clocksource/*/current_clocksource
xen
Then, I changed the clocksource to tsc:
ubuntu#machine:~% sudo sh -c "echo tsc >/sys/devices/system/clocksource/clocksource0/current_clocksource"
NOTE: it isn't recommended to move to the tsc clocksource on production machines since it may cause backwards drift for the clock.
See https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/ for a detailed write-up on the interaction between vDSO and the clocksource.
NOTE 2: it seems tsc support in Xen has improved with version 4.0 and with improved CPU support in Sandy Bridge+ platforms. Modern EC2 machines should be okay with tsc. Check Xen version using dmesg | grep "Xen version". Amazon recommended the tsc clocksource already in re:Invent 2015 (https://www.slideshare.net/AmazonWebServices/cmp402-amazon-ec2-instances-deep-dive). I'm not yet running to production with this, but the situation doesn't seem as bad as implied by packagecloud.
Additional reading:
Why rdtsc interacts poorly with VMs
Xen's 4.0 rdtsc changes
Linux kernel timekeeping documentation, discussing the pitfalls of the TSC
Related
Context
I'm working through some examples in a book by Johnathan Bartlett titled "Learn to Program with Assembly" (2021). The author assumes a linux environment. I'm on OSX (Monterey). He's using gcc. I've got clang (v 13.1.6). In chapter 7 the author introduces laying out data records.
To facilitate this, he uses the .equ directive to define some constants in a file titled persondata.s which happens to only contain a data segment. For example:
# Describe the components of the struct
.globl WEIGHT_OFFSET, HAIR_OFFSET, HEIGHT_OFFSET, AGE_OFFSET .equ WEIGHT_OFFSET, 0
.equ HAIR_OFFSET, 8
.equ HEIGHT_OFFSET, 16
.equ AGE_OFFSET, 24
In another file, tallest.s, he makes use of the HEIGHT_OFFSET constant to access the height of a person record. This file has only a text segment.
movq HEIGHT_OFFSET(%rbx), %rax
The Problem
When I assemble tallest.s using the built-in tools on OSX, the assembler complains that I'm trying to use 32-bit absolute addressing in 64-bit mode.
The Question
How is this supposed to work on OSX? How am I supposed to make use of .equ defined constants?
Things I Tried
If I merge these two files into one file, then assembler doesn't complain. It treats HEIGHT_OFFSET as the constant that it is.
I presume the idea is to have constants defined along with the data, and then make use of those constants in code to avoid 'magic numbers'. Sounds like a good idea.
I tried assembling, linking, and running this code using the book's docker image (johnnyb61820/linux-assembly). It works. No complaints. Some details
# as -v
GNU assembler version 2.31.1 (x86_64-linux-gnu) using BFD version (GNU Binutils for Debian) 2.31.1
^C
# ld -v
GNU ld (GNU Binutils for Debian) 2.31.1
# uname -a
Linux eded2adb9c06 5.10.124-linuxkit #1 SMP Thu Jun 30 08:19:10 UTC 2022 x86_64 GNU/Linux
So it works as written under that set-up. Just not under my set-up which is clang (v 13.1.6).
Based on the fact that this works in the linuxkit docker image, I thought to install gcc via homebrew on my machine. This got me version 12.2.0 of gcc, which I used to try and compile/link my files. It also thinks HEIGHT_OFFSET is a problem due to 32-bit absolute addressing in 64-bit mode.
Based on the output of name -a in the docker image, I'm guessing it is 64 bit. Linux eded2adb9c06 5.10.124-linuxkit #1 SMP Thu Jun 30 08:19:10 UTC 2022 x86_64 GNU/Linux
Oddly enough, it doesn't complain about 32-bit absolute addressing not being supported. Under OSX, I had to make everything rip-relative to access any static-data (true for both gcc and clang). Makes me wonder what it is doing with these addresses.
As a possibly final note, under OSX yasm also doesn't like me using .equ defined constants from another file. It complains about wanting to make use of "32 bit absolute relocations" in 64 bit mode. GCC (12.2.0) and llvm-mc (13.0.1) also take issue with the HEIGHT_OFFSET constant.
I just checked my linux box's config file /boot/config_$(uname -r) and I found both of these flags are defined:
CONFIG_X86_64=y
CONFIG_X86=y
Shouldn't these 2 flags be exclusive to each other?
In addition, I am wondering whether these 2 flags should be used in kernel only because I saw many
#ifdef CONFIG_X86_64
in kernel source code. Can user space application use this flag also?
In addition, since processor can be changed to compatibility mode from 64-bit mode. If this change happens, for code that depend on CONFIG_X86_64 will all fail at run time, right? How does application (kernel or user space) to detect whether machine is in 64 bit or compatibility mode?
Thanks.
CONFIG_X86 is the flag targetting the architecture, the whole x86 family.
This includes both the 32-bit and 64-bit processors.
This can be seen by looking at the latest kernel (at the time of writing it is 4.15.1) Kconfig file1
# SPDX-License-Identifier: GPL-2.0
# Select 32 or 64 bit
config 64BIT
bool "64-bit kernel" if ARCH = "x86"
default ARCH != "i386"
---help---
Say yes to build a 64-bit kernel - formerly known as x86_64
Say no to build a 32-bit kernel - formerly known as i386
config X86_32
def_bool y
depends on !64BIT
#... other options removed
config X86_64
def_bool y
depends on 64BIT
In this file, config options are stripped of the CONFIG_ prefix.
The CONFIG_X86_64 is defined iif CONFIG_64BIT is defined, otherwise CONFIG_X86_32 is.
Look at the depends on declaration to see it.
In a 64-bit kernel this command cat /boot/config-$(uname -r) | grep 'CONFIG_64BIT' should return CONFIG_64BIT=y.
This is also confirmed in this answer for a question on how to make a 32-bit config into a 64-bit one.
The antonym of CONFIG_X86_64 is thus CONFIG_X86_32.
TL;DR CONFIG_X86 is defined for all x86 processors, either bitness. CONFIG_X86_64 is defined only for the subset of x86 processors supporting AMD64/IA32e.
1 This link may change at any time soon. See this.
In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?
At the moment I'm getting invalid device function when running kernels compiled with -arch=sm_35 on a 670 (max sm_30) that don't require compute 3.5.
AFAIK you can't use multiple -arch=sm_* arguments and using multiple -gencode=* doesn't help. Also for separable compilation I've had to create an additional object file using -dlink, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary due to -lcudadevrt, which I've needed for 3.5), how should I deal with this?
I believe this issue has been addressed now in CUDA 6.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.
I don't believe there is a way to do this using the runtime API as of CUDA 5.5.
The only way I can think of to get around the problem is to use the driver API to perform your own architecture selection and load code from different cubin files at runtime. The APIs can be safely mixed, so it is only the context establishment-device selection-module load phase which needs to be done with the driver API. You can use the runtime API after that - you will need a little bit of homemade syntactic sugar for the kernel launches, but otherwise no code changes are required in other runtime API code.
Recently I'm trying to modify GCC and gcov to collect execution sequence for program. As we all know, gcc will instrument code on arcs between basic blocks to count the execution count of arc. So I instrument a function on the arc, and the function will print out the no of that arc, so I can collect program execution sequence. It works well for c program on x86 and x86_64, also for c++ program of x86. But for c++ program on x86_64, the program will crash by segment error. The compilation has no problem. The os that I use is CentOS 6.4. Version of gcc is 3.4.5. Does anybody has some advice?
sample program:
#include <iostream> using namespace std; int main(){cout<<"hello world"<<endl;}
If I compile the program in x86_64 mode. The program crash by Segment Error when comes to the cout CALL.
Ok, by another night debug on it. I found that the function emit_library_call will only generate asm code to invoke my function, but not protect the context (registers). So function call before or after the emitted code may fail due to nonuniform context. And x86_64 asm use different registers with x86. So to work well on x86 platform may be just accident. I need a function api which can emit library function call and also protect the context. Maybe I should write another emit_library_call.
Perhaps you might try a dynamic binary translation framework, e.g. DynamoRIO or Pin. These tools offer more flexibility than you need, but they would allow you do inject code at the beginning/end of each basic block. What you then want to do is save/restore the flags and registers (and potentially re-align the stack), and call out to a function. DynamoRIO has similar functionality built in, named a "clean call". I think Pin also enables this with a potentially higher-level interface.
I did same thing what you did in 3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25 17:13:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
#include <iostream>
`using namespace std;
int main()
{
cout<<"hello world"<<endl;
}`
compiled above code with g++ -ftest-coverage -fprofile-arcs hello.cpp -o hello
hello.gcno file is generated.
After executing ./hello hello.gcda file generated .
So once check your gcc version .
My gcc version is gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
Using MacPorts i have just installed arm-elf-gcc on to my MacBook Pro. This worked flawlessly and all seems to run fine.
However, after compiling a simple hello world test program in C and C++ and trying to run either on the target board (an ARM9 based board running Debian Linux) they immediately seg fault.
I'm a bit stuck as how to go about debugging this, as the target board has limited tools available and no gdb. I have successfully built and run other code using a Linux hosted cross compiler so it should work.
Any ideas?
Following the suggestion I have built and run gdbserver, I get the following in gdb on the host:
Program received signal SIGSEGV, Segmentation fault.
0x00000000 in ?? ()
I thought it may be a problem with the standard c libs so I removed any calls and have just an empty main that return 0, it is compiled with -Wall -g hello-arm.cpp -static. As a test I compiled the same source with a Linux hosted cross compiler and it runs and exits fine. The only difference I can see is the that Linux compiled version is over twice the size and the difference in output from the file command:
arm-elf-gcc: ELF 32-bit LSB executable, ARM, version 1, statically linked, not stripped
arm-*-linux: ELF 32-bit LSB executable, ARM, version 1, statically linked, for GNU/Linux 2.4.18, not stripped
The usual method of debugging in this situation is to run gdbserver on the target board, and connect to it (via ethernet) with gdb running on a host computer.
Alternately, you could try comparing the assembly in a Mac-compiled "Hello World" program and a (working) Linux-compiled one to see what's different.
After digging around for a couple of days I am starting to understand a bit more about embedded compilers. I wasn't really sure of the difference between arm-elf-gcc installed via MacPorts and the arm-unknown-linux toolchain I had installed on my Linux box. I just came across a pdf titled "An introduction to the GNU compiler" which contains the following paragraph:
Important: Using the GNU Compiler to
create your executable is not quite
the same as using the GNU Linker,
arm-elf-ld, yourself. The reason is
that the GNU Compiler automatically
links a number of standard system
libraries into your executable. These
libraries allow your program to
interact with an operating system, to
use the standard C library functions,
to use certain language features and
operations (such as division), and so
on. If you wish to see exactly which
libraries are being linked into the
executable, you should pass the
verbose flag
-v to the compiler.
This has important implications for
embedded systems! Such systems do not
usually have an operating system.
This means that linking in the system
libraries is almost always
meaningless: if there is no operating
system, for example, then calling the
standard printf function does not make
much sense.
So when I get back to my dev machine later I will determine the libraries linked in with the Linux build and add them to the arm-elf-gcc build.
I'll update this when I have more information but I just want to document my findings in case any one else has these problems.