interpreting perf tool results on different NUMA nodes

interpreting perf tool results on different NUMA nodes - cpu

I want to explain, using perf tool, an interesting behavior of my program.
Some background info
My machine has 4 NUMA nodes and my main application is running on the machine. Using cpusets I'm partitioning the machine to give 3 nodes to the application and 1 to the system. While running a unittest(for logging improvements) on the same machine I get unexpected behavior which I'm trying to investigate using the perf tool.
The unexpected behavior is that on the NUMA node where the application is running(lets say node2) some I get better times on my unittest, around twice as low as the ones when the unittest is running on the NUMA node with the system(let's say node3). It looks like running on a CPU of a node2(a cpu which does not do spinlock) I get better results than running on another numa.
I'm trying to improve my logging system, so the test is doing work which is also done by the application, writing some log messages to a queue for later dump to the disk(different thread). Contention to the queue is controlled by a spinlock(CAS). The unitest is one writer thread which has 2 loops: 100 times, 1000 logs are written to the queue and RDTSC(my choice) is used to measure each internal loop and then statistics is printed. Queue is big enough, I get a got standard deviation, memory operations are reduced to a minimum(no memcpy). The reader is dumping to the disk on another thread.
I tried stopping my application and running the test again. I see in this case that regardless of the NUMA node selected to run my test I get results similar with running on the node3(slow), meaning that the application being run is speeding up my test, if I run my test on the same numa node the application runs. Very unintuitive for me.
Using the following perf tool command to grab some data
rm /tmp/a; rm /tmp/ats; for f in $(perf list | tr -s ' ' | cut -d' ' -f 2 ); do perf stat -o /tmp/ats --append -e $f taskset -c $(cat /sys/devices/system/node/node2/cpulist) ./bin/testCore >> /tmp/a 2>&1;perf stat -o /tmp/ats --append -e $f taskset -c $(cat /sys/devices/system/node/node3/cpulist) ./bin/testCore >> /tmp/a 2>&1;done
and using the command to analyze
cat /tmp/ats | grep -v "#" | grep -v "^ *$" | grep -v "seconds time elapsed\|Performance\|supported" | grep -v " 0 " | tr -s ' ' | tr -d ',' | awk '{if(knownEvent[$2]){if(((knownEvent[$2]/$1)<0.95) || ((knownEvent[$2]/$1)>1.05)){printf "diff:%s %.2f(%d/%d)\n",$2,(knownEvent[$2]/$1),knownEvent[$2],$1}}else{knownEvent[$2]=$1}}'| more
diff:branch-misses 1.08(470184/435284)
diff:bus-cycles 0.70(25570333/36774093)
diff:cpu-clock 0.69(191/276)
diff:L1-dcache-prefetch-misses 1.05(624172/593780)
diff:LLC-loads 0.92(665199/720197)
diff:LLC-load-misses 1.16(217684/187823)
diff:LLC-prefetches 0.87(193398/222148)
diff:LLC-prefetch-misses 0.85(98653/115483)
diff:dTLB-load-misses 1.09(195668/180291)
diff:dTLB-store-misses 1.11(39755/35910)
diff:scsi:scsi_dispatch_cmd_start 1.38(11/8)
diff:block:block_rq_insert 0.58(36/62)
diff:block:block_rq_issue 1.50(12/8)
diff:block:block_bio_backmerge 1.05(683/650)
diff:block:block_getrq 0.57(36/63)
diff:block:block_plug 0.52(26/50)
diff:kmem:mm_pagevec_free 1.06(16853/15838)
diff:kmem:mm_kernel_pagefault 0.88(7/8)
diff:timer:timer_start 0.36(21/59)
diff:timer:timer_expire_entry 3.00(3/1)
diff:timer:timer_expire_exit 2.00(2/1)
diff:timer:timer_cancel 1.60(8/5)
diff:timer:hrtimer_start 0.93(1275/1368)
diff:timer:hrtimer_expire_entry 0.72(186/259)
diff:timer:hrtimer_expire_exit 0.69(182/264)
diff:timer:hrtimer_cancel 0.74(192/259)
diff:irq:softirq_entry 0.64(374/585)
diff:irq:softirq_exit 0.74(394/536)
diff:irq:softirq_raise 0.73(390/531)
diff:sched:sched_wakeup 0.75(3/4)
diff:sched:sched_stat_wait 0.50(1/2)
diff:sched:sched_stat_sleep 2.00(2/1)
diff:sched:sched_stat_runtime 0.93(1284/1379)
I extracted the ones I see containing some difference.
Not sure if I should suspect the sched-, LLC-, branch-, TLB because I have no clue if the relative difference expalains the behavior I see.
Any suggestion of a better way to investigate this?

Related

grep - how to output progress bar or status

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).
I know this is not trivial because grep outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR .
Would this require modifying source code of grep?
Ideal command is:
grep -e "STRING" --results="FILE.txt"
and the progress:
[curr file being searched], number x/total number of files
written to STDOUT or STDERR

This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.
If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r option to recursively a directory structure. In that case, it is not even clear that grep knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)
In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat to get the filesize) would make the progress report more exact but add an additional cost to process startup.
One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.
In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).
# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[#]}
for ((i=0; i<total; i+=100)); do
echo $i/$total >>/dev/stderr
grep -d skip -e "$pattern" "${files[#]:i:100}" >>results.txt
done
For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/ only matches directories.) Fortunately, GNU grep provides the -d skip option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.
You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.
The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:
find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt
(Here -L 100 specifies that up to 100 files should be given to each grep instance, and -j 4 specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)

Try the parallel program
find * -name \*.[ch] | parallel -j5 --bar '(grep grep-string {})' > output-file
Though I found this to be slower than a simple
find * -name \*.[ch] | xargs grep grep-string > output-file

This command show the progress (speed and offset), but not the total amount. This could be manually estimated however.
dd if=/input/file bs=1c skip=<offset> | pv | grep -aob "<string>"

I'm pretty sure you would need to alter the grep source code. And those changes would be huge.
Currently grep does not know how many lines a file as until it's finished parsing the whole file. For your requirement it would need to parse the file 2 times or a least determine the full line count any other way.
The first time it would determine the line count for the progress bar. The second time it would actually do the work an search for your pattern.
This would not only increase the runtime but violate one of the main UNIX philosophies.
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features". (source)
There might be other tools out there for your need, but afaik grep won't fit here.

I normaly use something like this:
grep | tee "FILE.txt" | cat -n | sed 's/^/match: /;s/$/ /' | tr '\n' '\r' 1>&2
It is not perfect, as it does only display the matches, and if they to long or differ to much in length there are errors, but it should provide you with the general idea.
Or a simple dots:
grep | tee "FILE.txt" | sed 's/.*//' | tr '\n' '.' 1>&2

Monitor disk space in an `until` loop

my first question here. Hope it's a good one.
So I'm hoping to create a script that kills another script running arecord when my disk gets to a certain usage. (I should point out, I'm not exactly sure how I got to that df filter... just kinda searched around...) My plan is to run both scripts (the one recording, and the one monitoring disk usage) in separate screens.
I'm doing this all on a Raspberry Pi, btw.
So this is my code so far:
#!/bin/bash
DISK=$(df / | grep / | awk '{ print $5}' | sed 's/%//g')
until [ $DISK -ge 50 ]
do
sleep 1
done
killall arecord
This code works when I play with the starting value ("50" changed to "30" or so). But it doesn't seem to "monitor" my disk the way I want it to. I have a bit of an idea what's going on: the variable DISK is only assigned once, not checked or redefined periodically.
In other words, I probably want something in my until loop that "gets" the disk usage from df, right? What are some good ways of going about it?
=
PS I'd be super interested in hearing how I might incorporate this whole script's purpose into the script running arecord itself, but that's beyond me right now... and another question...

You are only setting DISK once since it's done before the loop starts and not done as part of the looping process.
A simple fix is to incorporate the evaluation of the disk space into the actual while loop itself, something like:
#!/bin/bash
until [ $(df / | awk 'NR==2 {print $5}' | tr -d '%') -ge 50 ] ; do
sleep 1
done
killall arecord
You'll notice I've made some minor mods to the command as well, specifically:
You can use awk itself to get the relevant line from the ps output, no need to use grep in a separate pipeline stage.
I prefer tr for deleting single characters, sed can do it but it's a bit of overkill.

I need create a script to generate log command TOP

I have that generate a history of server's memory and cpu used. But i can't install any software on server.
Then i am trying this script
#!/bin/bash
while true; do
date >> system.log
top -n1 | grep 'Cpu\|Mem\|java\|eservices' >> system.log
echo '' >> system.log
sleep 2
done
but when i try execute tail -500f system.log the logs stopping

You should probably use the -b batch mode parameter. From man top:
Starts top in 'Batch' mode, which could be useful for sending output from top to other programs or to a file. In this mode, top will not accept input and runs until the iterations limit you've set with the '-n' command-line option or until killed.
You might want to use the portable format tail -n 500 -f.
In any case, saving top output to file and then running tail -f on it emulates the way top works. What are you trying to achieve that top does not do already?

To monitor total free memory use on a server, you can
grep -F MemFree: /proc/meminfo
To monitor process memory use:
ps -o rss $pid

How to generate a memory shortage using bash script

I need to write a bash script that would consume a maximum of RAM of my ESXi and potentially generate a memory shortage.
I already checked here and try to run the given script several times so that I can consume more thant 500Mb of RAM.
However I get a "sh: out of memory" error (of course) and I'd like to know if there is any possibility to configuration the amount of memory allocated to my shell ?
Note1 : Another requirement is that I cannot enter a VM a run a greedy task.
Note2 : I tried to script the creation of greedy new VMs with huge RAM however I cannot get to ESXi state where there is a shortage of memory.
Note3 : I cannot use a C compiler and I only have very limited python library.
Thank you in advance for your help :)

From an earlier answer of mine: https://unix.stackexchange.com/a/254976/30731
If you have basic GNU tools (head and tail) or BusyBox on Linux, you can do this to fill a certain amount of free memory:
</dev/zero head -c BYTES | tail
# Protip: use $((1024**3*7)) to calculate 7GiB easily
</dev/zero head -c $((1024**3*7)) | tail
This works because tail needs to keep the current line in memory, in case it turns out to be the last line. The line, read from /dev/zero which outputs only null bytes and no newlines, will be infinitely long, but is limited by head to BYTES bytes, thus tail will use only that much memory. For a more precise amount, you will need to check how much RAM head and tail itself use on your system and subtract that.
To just quickly run out of RAM completely, you can remove the limiting head part:
tail /dev/zero
If you want to also add a duration, this can be done quite easily in bash (will not work in sh):
cat <( </dev/zero head -c BYTES) <(sleep SECONDS) | tail
The <(command) thing seems to be little known but is often extremely useful, more info on it here: http://tldp.org/LDP/abs/html/process-sub.html
Then for the use of cat: cat will wait for inputs to complete until exiting, and by keeping one of the pipes open, it will keep tail alive.
If you have pv and want to slowly increase RAM use:
</dev/zero head -c BYTES | pv -L BYTES_PER_SEC | tail
For example:
</dev/zero head -c $((1024**3)) | pv -L $((1024**2)) | tail
Will use up to a gigabyte at a rate of a megabyte per second. As an added bonus, pv will show you the current rate of use and the total use so far. Of course this can also be done with previous variants:
</dev/zero head -c BYTES | pv | tail
Just inserting the | pv | part will show you the current status (throughput and total by default).
Credits to falstaff for contributing a variant that is even simpler and more broadly compatible (like with BusyBox).

Bash has the ulimit command, which can be used to set the size of the virtual memory which may be used by a bash process and the processes started by it.
There are two limits, a hard limit and a soft limit. You can lower both limits, but only raise the soft limit up to the hard limit.
ulimit -S -v unlimited sets the virtual memory size to unlimited (if the hard limit allows this).
If there is a hard limit set (see ulimit -H -v), check any initialisation scripts for lines setting it.

Counting context switches per thread

Is there a way to see how many context switches each thread generates? (both in and out if possible) Either in X/s, or to let it run and give aggregated data after some time.
(either on linux or on windows)
I have found only tools that give aggregated context-switching number for whole os or per process.
My program makes many context switches (50k/s), probably a lot not necessary, but I am not sure where to start optimizing, where do most of those happen.

On recent GNU/Linux systems you can use SystemTap to collect the data you want on every call to sched_switch(). The schedtimes.stp example is probably a good start: http://sourceware.org/systemtap/examples/keyword-index.html#SCHEDULER

Linux
I wrote a small script to see the details of a specific thread of the process. By executing this script you can see context switch as well.
if [ "$#" -ne 2 ]; then
echo "INVALID ARGUMENT ERROR: Please use ./see_thread.sh processName threadNumber"
exit
fi
ls /proc/`pgrep $1`/task | head -n$2 | tail -n+$2>temp
cat /proc/`pgrep $1`/task/`cat temp`/sched
Hope this will help.

I've a bash script that calculates voluntary and non-voluntary context switches made by a thread during a specific time frame. I'm not sure whether this will serve your purpose but I'll post it anyway.
This script is looping over all threads of a process and recording "voluntary_ctxt_switches" & "nonvoluntary_ctxt_switches" from /proc/< process-id>/task/< thread-id>/status. What I do generally is record these counters at the start of a performance run and record again at the end of the run and then calculate difference as total vol & non-vol ctx switches during the performance run.
pid=`ps -ef | grep <process name> | grep $USER | grep -v grep | awk '{print $2}'`
echo "ThreadId;Vol_Ctx_Switch;Invol_Ctx_Switch"
for tid in `ps -L --pid ${pid} | awk '{print $2}'`
do
if [ -f /proc/$pid/task/$tid/status ]
then
vol=`cat /proc/$pid/task/$tid/status | grep voluntary_ctxt_switches | grep -v nonvoluntary_ctxt_switches | awk '{print $NF}'`
non_vol=`cat /proc/$pid/task/$tid/status | grep nonvoluntary_ctxt_switches | awk '{print $NF}'`
fi
echo "$tid;$vol;$non_vol"
done
Script is bit heavy, in my case process has around 2500 threads. Total time to collect ctx switches is around 10 seconds.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio