number of forks a shell script performs during execution - shell

Is there a way to compute the number of forks a shell script performs while it is executing? I've been looking at maybe writing a C wrapper using getrusage(2) and analyzing the various fields of
struct rusage {
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
long ru_maxrss; /* max resident set size */
long ru_ixrss; /* integral shared text memory size */
long ru_idrss; /* integral unshared data size */
long ru_isrss; /* integral unshared stack size */
long ru_minflt; /* page reclaims */
long ru_majflt; /* page faults */
long ru_nswap; /* swaps */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
long ru_msgsnd; /* messages sent */
long ru_msgrcv; /* messages received */
long ru_nsignals; /* signals received */
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
but the number of forks isn't available here. Next idea is to strace shell and children and look for the forks. Is there a simpler way with less overhead? Is there some shell with a nonstandard option/variable/mechanism to show the number of forks?

There are a few options:
the best multi-platform approach is likely strace or its equivalent (truss, ktrace), or dtrace. See below. This also lets you attach to a running process.
a workable, if slightly tricky multi-platform approach is to create a dynamic library with your own versions of fork/execve etc, which log the calls, and then call the real C library functions. Search SO for LD_PRELOAD to get some ideas. This won't work on statically linked binaries though.
on Linux and Solaris you can set the environment variable LD_DEBUG=files, and the dynamic linker will issue various diagnostics as both executables and libraries are loaded, it should be enlightening for such a simple step. On Linux each new process should output some or most of "initialize", "init" and "fini" entries, along with PIDs. This won't work on statically linked binaries though.
if you are on Linux or *BSD or Solaris, and have root access, and process accounting is available you can run your command(s), and then inspect the output of lastcomm or dump-acct. This might require accounting to be started (if it is not already running). This might not provide the details you need on some platforms. It can be done easily on RH/CentOS 6, and provides all the details needed. Other systems also have process accounting.
if you are on Linux and have auditd support you can use autrace myscript.sh to log system calls. (auditd should be running for this so the kernel data is logged to the audit file)
for completeness: you could use a debugger, but that's about the most tedious approach I can think of ;-)
On Linux you can trace execution (moderate performance penalty) with:
strace -f -o /tmp/myscript.trace -e trace=process ./myscript.sh
Then inspect the .trace file. The parameter -e trace=process filters to show only process related syscalls.
On Solaris you can trace with:
truss -f -o /tmp/myscript.trace \
-u libc:fork,execl,execv,execle,execve,execlp,execvp ./myscript.sh
Solaris truss lets you trace both userland libraries and kernel syscalls. You could also use dtrace, see here for some ideas: http://www.brendangregg.com/DTrace/lostcpu.html
Other platforms have variables similar to LD_DEBUG or LD_VERBOSE, see the linker documentation (e.g. man ld.so).
In the above cases you should understand that what programs (usually) call are C library functions, e.g. fork(), what is requested of the kernel depends on the OS at least, and it may result in a syscall of vfork, execve or clone.

Related

Reading performance registers from the kernel

I want to read certain performance counters. I know that there are tools like perf, that can do it for me in the user space itself, I want the code to be inside the Linux kernel.
I want to write a mechanism to monitor performance counters on Intel(R) Core(TM) i7-3770 CPU. On top of using I am using Ubuntu kernel 4.19.2. I have gotten the following method from easyperf
Here's part of my code to read instructions.
struct perf_event_attr *attr
memset (&pe, 0, sizeof (struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof (struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 0;
pe.exclude_kernel = 0;
pe.exclude_user = 0;
pe.exclude_hv = 0;
pe.exclude_idle = 0;
fd = syscall(__NR_perf_event_open, hw, pid, cpu, grp, flags);
uint64_t perf_read(int fd) {
uint64_t val;
int rc;
rc = read(fd, &val, sizeof(val));
assert(rc == sizeof(val));
return val;
}
I want to put the same lines in the kernel code (in the context switch function) and check the values being read.
My end goal is to figure out a way to read performance counters for a process, every time it switches to another, from the kernel(4.19.2) itself.
To achieve this I check out the code for the system call number __NR_perf_event_open. It can be found here
To make to usable I copied the code inside as a separate function, named it perf_event_open() in the same file and exported.
Now the problem is whenever I call perf_event_open() in the same way as above, the descriptor returned is -2. Checking with the error codes, I figured out that the error was ENOENT. In the perf_event_open() man page, the cause of this error is defined as wrong type field.
Since file descriptors are associated to the process that's opened them, how can one use them from the kernel? Is there an alternative way to configure the pmu to start counting without involving file descriptors?
You probably don't want the overhead of reprogramming a counter inside the context-switch function.
The easiest thing would be to make system calls from user-space to program the PMU (to count some event, probably setting it to count in kernel mode but not user-space, just so the counter overflows less often).
Then just use rdpmc twice (to get start/stop counts) in your custom kernel code. The counter will stay running, and I guess the kernel perf code will handle interrupts when it wraps around. (Or when its PEBS buffer is full.)
IDK if it's possible to program a counter so it just wraps without interrupting, for use-cases like this where you don't care about totals or sample-based profiling, and just want to use rdpmc. If so, do that.
Old answer, addressing your old question which was based on a buggy printf format string that was printing non-zero garbage even though you weren't counting anything in user-space either.
Your inline asm looks correct, so the question is what exactly that PMU counter is programmed to count in kernel mode in the context where your code runs.
perf virtualizes the PMU counters on context-switch, giving the illusion of perf stat counting a single process even when it migrates across CPUs. Unless you're using perf -a to get system-wide counts, the PMU might not be programmed to count anything, so multiple reads would all give 0 even if at other times it's programmed to count a fast-changing event like cycles or instructions.
Are you sure you have perf set to count user + kernel events, not just user-space events?
perf stat will show something like instructions:u instead of instructions if it's limiting itself to user-space. (This is the default for non-root if you haven't lowered sysctl kernel.perf_event_paranoid to 0 or something from the safe default that doesn't let user-space learn anything about the kernel.)
There's HW support for programming a counter to only count when CPL != 0 (i.e. not in ring 0 / kernel mode). Higher values for kernel.perf_event_paranoid restrict the perf API to not allow programming counters to count in kernel+user mode, but even with paranoid = -1 it's possible to program them this way. If that's how you programmed a counter, then that would explain everything.
We need to see your code that programs the counters. That doesn't happen automatically.
The kernel doesn't just leave the counters running all the time when no process has used a PAPI function to enable a per-process or system-wide counter; that would generate interrupts that slow the system down for no benefit.

Why is it required to pass the flags parameter to local_irq_save as a stack variable?

Robert Love's "Linux Kernel Development" book says that it is required that we pass the flags to the local_irq_save() as a stack variable.
Why is this required? Is it okay to bypass this requirement in x86?
You probably refer to this quote from LKD3:
local_irq_save(flags); /* interrupts are now disabled */
/* ... */
local_irq_restore(flags); /* interrupts are restored to their previous state */
Note that these methods are implemented at least in part as macros, so the flags
parameter (which must be defined as an unsigned long) is seemingly passed by value.
This parameter contains architecture-specific data containing the state of the interrupt systems. Because at least one supported architecture incorporates stack information into the
value (ahem, SPARC), flags cannot be passed to another function (specifically, it must remain on the same stack frame). For this reason, the call to save and the call to restore interrupts must occur in the same function.
I don't see any requirements for the flags variable to be declared on stack.
What book says is:
some architectures (e.g. SPARC) add some stack-specific information to flags variable, when doing local_irq_save(), along with interrupts information
so you shouldn't pass flags variable to another function
and you should run local_irq_save() / local_irq_restore() in the same stack frame
You must be confused by this statement:
specifically, it must remain on the same stack frame
I'd change it a little bit:
specifically, it must remain on the same stack frame between save/restore calls
More than that, if you do:
$ git grep --all-match -e 'local_irq_save(\*' -- kernel/ include/linux/
you will see that even core kernel code declares flags on heap, sometimes.
As for mentioned implementation on SPARC architecture: book probably refers to this code. So on SPARC architecture the flags variable will contain PSR register, which in turn contains CWP field, and it probably has to do with stack, somehow.
Is it okay to bypass this requirement in x86?
When you are writing architecture-independent code (like drivers), you should consider behavior on all architectures. So when using architecture-independent API, you shouldn't usually think about quirks on some particular platforms. But again, you can declare flags on heap any time you want.

why need linker script and startup code?

I've read this tutorial
I could follow the guide and run the code. but I have questions.
1) Why do we need both load-address and run-time address. As I understand it is because we have put .data at flash too; so why we don't run app there, but need start-up code to copy it into RAM?
http://www.bravegnu.org/gnu-eprog/c-startup.html
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
arm-none-eabi-gcc -nostdlib -o sum_array.elf sum_array.c
Many thanks
Your first question was answered in the guide.
When you load a program on an operating system your .data section, basically non-zero globals, are loaded from the "binary" into the right offset in memory for you, so that when your program starts those memory locations that represent your variables have those values.
unsigned int x=5;
unsigned int y;
As a C programmer you write the above code and you expect x to be 5 when you first start using it yes? Well, if are booting from flash, bare metal, you dont have an operating system to copy that value into ram for you, somebody has to do it. Further all of the .data stuff has to be in flash, that number 5 has to be somewhere in flash so that it can be copied to ram. So you need a flash address for it and a ram address for it. Two addresses for the same thing.
And that begins to answer your second question, for every line of C code you write you assume things like for example that any function can call any other function. You would like to be able to call functions yes? And you would like to be able to have local variables, and you would like the variable x above to be 5 and you might assume that y will be zero, although, thankfully, compilers are starting to warn about that. The startup code at a minimum for generic C sets up the stack pointer, which allows you to call other functions and have local variables and have functions more than one or two lines of code long, it zeros the .bss so that the y variable above is zero and it copies the value 5 over to ram so that x is ready to go when the code your entry point C function is run.
If you dont have an operating system then you have to have code to do this, and yes, there are many many many sandboxes and toolchains that are setup for various platforms that already have the startup and linker script so that you can just
gcc -O myprog.elf myprog.c
Now that doesnt mean you can make system calls without a...system...printf, fopen, etc. But if you download one of these toolchains it does mean that you dont actually have to write the linker script nor the bootstrap.
But it is still valuable information, note that the startup code and linker script are required for operating system based programs too, it is just that native compilers for your operating system assume you are going to mostly write programs for that operating system, and as a result they provide a linker script and startup code in that toolchain.
1) The .data section contains variables. Variables are, well, variable -- they change at run time. The variables need to be in RAM so that they can be easily changed at run time. Flash, unlike RAM, is not easily changed at run time. The flash contains the initial values of the variables in the .data section. The startup code copies the .data section from flash to RAM to initialize the run-time variables in RAM.
2) Linker-script: The object code created by your compiler has not been located into the microcontroller's memory map. This is the job of the linker and that is why you need a linker script. The linker script is input to the linker and provides some instructions on the location and extent of the system's memory.
Startup code: Your C program that begins at main does not run in a vacuum but makes some assumptions about the environment. For example, it assumes that the initialized variables are already initialized before main executes. The startup code is necessary to put in place all the things that are assumed to be in place when main executes (i.e., the "run-time environment"). The stack pointer is another example of something that gets initialized in the startup code, before main executes. And if you are using C++ then the constructors of static objects are called from the startup code, before main executes.
1) Why do we need both load-address and run-time address.
While it is in most cases possible to run code from memory mapped ROM, often code will execute faster from RAM. In some cases also there may be a much larger RAM that ROM and application code may compressed in ROM, so the executable code may not simply be copied from ROM also decompressed - allowing a much larger application than the available ROM.
In situations where the code is stored on non-memory mapped mass-storage media such as NAND flash, it cannot be executed directly in any case and must be loaded into RAM by some sort of bootloader.
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
The linker script defines the memory layout of you target and application. Since this tutorial is for bare-metal programming, there is no OS to handle that for you. Similarly the start-up code is required to at least set an initial stack-pointer, initialise static data, and jump to main. On an embedded system it is also necessary to initialise various hardware such as the PLL, memory controllers etc.

Make a system call to get list of processes

I'm new on modules programming and I need to make a system call to retrieve the system processes and show how much CPU they are consuming.
How can I make this call?
Why would you implement a system call for this? You don't want to add a syscall to the existing Linux API. This is the primary Linux interface to userspace and nobody touches syscalls except top kernel developers who know what they do.
If you want to get a list of processes and their parameters and real-time statuses, use /proc. Every directory that's an integer in there is an existing process ID and contains a bunch of useful dynamic files which ps, top and others use to print their output.
If you want to get a list of processes within the kernel (e.g. within a module), you should know that the processes are kept internally as a doubly linked list that starts with the init process (symbol init_task in the kernel). You should use macros defined in include/linux/sched.h to get processes. Here's an example:
#include <linux/module.h>
#include <linux/printk.h>
#include <linux/sched.h>
static int __init ex_init(void)
{
struct task_struct *task;
for_each_process(task)
pr_info("%s [%d]\n", task->comm, task->pid);
return 0;
}
static void __exit ex_fini(void)
{
}
module_init(ex_init);
module_exit(ex_fini);
This should be okay to gather information. However, don't change anything in there unless you really know what you're doing (which will require a bit more reading).
There are syscalls for that, called open, and read. The information of all processes are all kept in /proc/{pid} directories. You can gather process information by reading corresponding files.
More explained here: http://www.tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html

how can i override malloc(), calloc(), free() etc under OS X?

Assuming the latest XCode and GCC, what is the proper way to override the memory allocation functions (I guess operator new/delete as well). The debugging memory allocators are too slow for a game, I just need some basic stats I can do myself with minimal impact.
I know its easy in Linux due to the hooks, and this was trivial under codewarrior ten years ago when I wrote HeapManager.
Sadly smartheap no longer has a mac version.
I would use library preloading for this task, because it does not require modification of the running program. If you're familiar with the usual Unix way to do this, it's almost a matter of replacing LD_PRELOAD with DYLD_INSERT_LIBRARIES.
First step is to create a library with code such as this, then build it using regular shared library linking options (gcc -dynamiclib):
void *malloc(size_t size)
{
void * (*real_malloc)(size_t);
real_malloc = dlsym(RTLD_NEXT, "malloc");
fprintf(stderr, "allocating %lu bytes\n", (unsigned long)size);
/* Do your stuff here */
return real_malloc(size);
}
Note that if you also divert calloc() and its implementation calls malloc(), you may need additional code to check how you're being called. C++ programs should be pretty safe because the new operator calls malloc() anyway, but be aware that no standard enforces that. I have never encountered an implementation that didn't use malloc(), though.
Finally, set up the running environment for your program and launch it (might require adjustments depending on how your shell handles environment variables):
export DYLD_INSERT_LIBRARIES=./yourlibrary.dylib
export DYLD_FORCE_FLAT_NAMESPACE=1
yourprogram --yourargs
See the dyld manual page for more information about the dynamic linker environment variables.
This method is pretty generic. There are limitations, however:
You won't be able to divert direct system calls
If the application itself tricks you by using dlsym() to load malloc's address, the call won't be diverted. Unless, however, you trick it back by also diverting dlsym!
The malloc_default_zone technique mentioned at http://lists.apple.com/archives/darwin-dev/2005/Apr/msg00050.html appears to still work, see e.g. http://code.google.com/p/fileview/source/browse/trunk/fileview/fv_zone.cpp?spec=svn354&r=354 for an example use that seems to be similar to what you intend.
After much searching (here included) and issues with 10.7 I decided to write a blog post about this topic: How to set malloc hooks in OSX Lion
You'll find a few good links at the end of the post with more information on this topic.
The basic solution:
malloc_zone_t *dz=malloc_default_zone();
if(dz->version>=8)
{
vm_protect(mach_task_self(), (uintptr_t)malloc_zones, protect_size, 0, VM_PROT_READ | VM_PROT_WRITE);//remove the write protection
}
original_free=dz->free;
dz->free=&my_free; //this line is throwing a bad ptr exception without calling vm_protect first
if(dz->version==8)
{
vm_protect(mach_task_self(), (uintptr_t)malloc_zones, protect_size, 0, VM_PROT_READ);//put the write protection back
}
This is an old question, but I came across it while trying to do this myself. I got curious about this topic for a personal project I was working on, mainly to make sure that what I thought was automatically deallocated was being properly deallocated. I ended up writing a C++ implementation to allow me to track the amount of allocated heap and report it out if I so chose.
https://gist.github.com/monitorjbl/3dc6d62cf5514892d5ab22a59ff34861
As the name notes, this is OSX-specific. However, I was able to do this on Linux environments using the malloc_usable_size
Example
#define MALLOC_DEBUG_OUTPUT
#include "malloc_override_osx.hpp"
int main(){
int* ip = (int*)malloc(sizeof(int));
double* dp = (double*)malloc(sizeof(double));
free(ip);
free(dp);
}
Building
$ clang++ -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk \
-pipe -stdlib=libc++ -std=gnu++11 -g -o test test.cpp
$ ./test
0x7fa28a403230 -> malloc(16) -> 16
0x7fa28a403240 -> malloc(16) -> 32
0x7fa28a403230 -> free(16) -> 16
0x7fa28a403240 -> free(16) -> 0
Hope this helps someone else out in the future!
If the basic stats you need can be collected in a simple wrapper, a quick (and kinda dirty) trick is just using some #define macro replacement.
void* _mymalloc(size_t size)
{
void* ptr = malloc(size);
/* do your stat work? */
return ptr;
}
and
#define malloc(sz_) _mymalloc(sz_)
Note: if the macro is defined before the _mymalloc definition it will end up replacing the malloc call inside that function leaving you with infinite recursion... so ensure this isn't the case. You might want to explicitly #undef it before that function definition and simply (re)define it afterward depending on where you end up including it to hopefully avoid this situation.
I think if you define a malloc() and free() in your own .c file included in the project the linker will resolve that version.
Now then, how do you intend to implement malloc?
Check out Emery Berger's -- the author of the Hoard memory allocator's -- approach for replacing the allocator on OSX at https://github.com/emeryberger/Heap-Layers/blob/master/wrappers/macwrapper.cpp (and a few other files you can trace yourself by following the includes).
This is complementary to Alex's answer, but I thought this example was more to-the-point of replacing the system provided allocator.

Resources