Analyzing readdir() performance - performance

It's bothering me that linux takes so long to list all files for huge directories, so I created a little test script that recursively lists all files of a directory:
#include <stdio.h>
#include <dirent.h>
int list(char *path) {
int i = 0;
DIR *dir = opendir(path);
struct dirent *entry;
char new_path[1024];
while(entry = readdir(dir)) {
if (entry->d_type == DT_DIR) {
if (entry->d_name[0] == '.')
continue;
strcpy(new_path, path);
strcat(new_path, "/");
strcat(new_path, entry->d_name);
i += list(new_path);
}
else
i++;
}
closedir(dir);
return i;
}
int main() {
char *path = "/home";
printf("%i\n", list(path));
return 0;
When compiling it with gcc -O3, the program runs about 15 sec (I ran the programm a few times and it's approximately constant, so the fs cache should not play a role here):
$ /usr/bin/time -f "%CC %DD %EE %FF %II %KK %MM %OO %PP %RR %SS %UU %WW %XX %ZZ %cc %ee %kk %pp %rr %ss %tt %ww %xx" ./a.out
./a.outC 0D 0:14.39E 0F 0I 0K 548M 0O 2%P 178R 0.30S 0.01U 0W 0X 4096Z 7c 14.39e 0k 0p 0r 0s 0t 1692w 0x
So it spends about S=0.3sec in kernelspace and U=0.01sec in userspace and has 7+1692 context switches.
A context switch takes about 2000nsec * (7+1692) = 3.398msec [1]
However, there are more than 10sec left and I would like to find out what the program is doing in this time.
Are there any other tools to investigate what the program is doing all the time?
gprof just tells me the time for the (userspace) call graph and gcov does not list time spent in each line but only how often a time is executed...
[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

oprofile is a decent sampling profiler which can profile both user and kernel-mode code.
According to your numbers, however, approximately 14.5 seconds of the time is spent asleep, which is not really registered well by oprofile. Perhaps what may be more useful would be ftrace combined with a reading of the kernel code. ftrace provides trace points in the kernel which can log a message and stack trace when they occur. The event that would seem most useful for determining why your process is sleeping would be the sched_switch event. I would recommend that you enable kernel-mode stacks and the sched_switch event, set a buffer large enough to capture the entire lifetime of your process, then run your process and stop tracing immediately after. By reviewing the trace, you will be able to see every time your process went to sleep, whether it was runnable or non-runnable, a high resolution time stamp, and a call stack indicating what put it to sleep.
ftrace is controlled through debugfs. On my system, this is mounted in /sys/kernel/debug, but yours may be different. Here is an example of what I would do to capture this information:
# Enable stack traces
echo "1" > /sys/kernel/debug/tracing/options/stacktrace
# Enable the sched_switch event
echo "1" > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
# Make sure tracing is enabled
echo "1" > /sys/kernel/debug/tracing/tracing_on
# Run the program and disable tracing as quickly as possible
./your_program; echo "0" > /sys/kernel/debug/tracing/tracing_on
# Examine the trace
vi /sys/kernel/debug/tracing/trace
The resulting output will have lines which look like this:
# tracer: nop
#
# entries-in-buffer/entries-written: 22248/3703779 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] d..3 2113.437500: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:0 next_pid=878 next_prio=120
<idle>-0 [000] d..3 2113.437531: <stack trace>
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> rest_init
=> start_kernel
kworker/0:0-878 [000] d..3 2113.437836: sched_switch: prev_comm=kworker/0:0 prev_pid=878 prev_prio=120 prev_state=S ==> next_comm=your_program next_pid=898 next_prio=120
kworker/0:0-878 [000] d..3 2113.437866: <stack trace>
=> __schedule
=> schedule
=> worker_thread
=> kthread
=> ret_from_fork
The lines you will care about will be when your program appears as the prev_comm task, meaning the scheduler is switching away from your program to run something else. prev_state will indicate that your program was still runnable (R) or was blocked (S, U or some other letter, see the ftrace source). If blocked, you can examine the stack trace and the kernel source to figure out why.

Related

Why is this if statement forcing zenity's --auto-close to close immediately?

Got a Debian package set up for a little application I was working on, this isn't really relevant for the question but for some context, the application is a simple bash script that deploys some docker containers on the local machine. But I wanted to add a dependency check to make sure the system had docker before it attempted to do anything. If it doesn't, download it, if it does, ignore it. Figured it be nice to have a little zenity dialog alongside it to show what was going on.
In that process, I check for internet before starting for obvious reasons and for some reason, the way I check if there is internet if zenity has the --auto-close flag, will instantly close the entire progress block.
Here is a little dummy example, that if statement is a straight copy-paste from my code, everything else is filler. :
#!/bin/bash
condition=0
if [[ $condition ]]; then
(
echo "0"
# Check for internet
if ping -c 3 -W 3 gcr.io; then
echo "# Internet detected, starting updates..."; sleep 1
echo "10"
else
err_msg="# No internet detected. You may be missing some dependencies.
Services may not function as expected until they are installed."
echo $err_msg
zenity --error --text="$err_msg"
echo "100"
exit 1
fi
echo "15"
echo "# Downloading a thing" ; sleep 1
echo "50"
if zenity --question --text="Do you want to download a special thing?"; then
echo "# Downloading special thing" ; sleep 1
else
echo "# Not downloading special thing" ; sleep 1
fi
echo "75"
echo "# downloading big thing" ; sleep 3
echo "90"
echo "# Downloading last thing" ; sleep 1
echo "100"
) |
zenity --progress --title="Dependency Management" --text="downloading dependencies, please wait..." \
--percentage=0 --auto-close
fi
So im really just wondering why this is making zenity freak-out. If you comment out that if statement, everything works as you expect and zenity progress screen closes once it hits 100.
If you keep the if statement but remove the auto-close flag, it will execute as expected. It's like its initializing at 100 and then going to 0 to progress normally. But if that was the case, --auto-close would never work but in the little example they give you in the help section, it works just fine. https://help.gnome.org/users/zenity/stable/progress.html.en
Thank you for a fun puzzle! Spoiler is at the end, but I thought it might be helpful to look over my shoulder while I poked at the problem. ๐Ÿ˜€๏ธ If you're more interested in the answer than the journey, feel free to scroll. I'll never know, anyway.
Following my own advice (see 1st comment beneath the question), I set out to create a small, self-contained, complete example. But, as they say in tech support: Before you can debug the problem, you need to debug the customer. (No offense; I'm a terrible witness myself unless I know ahead of time that someone's going to need to reproduce a problem I've found.)
I interpreted your comment about checking for Internet to mean "it worked before I added the ping and failed afterward," so the most sensible course of action seemed to be commenting out that part of the code... and then it worked! So what happens differently when the ping is added?
Changes in timing wouldn't make sense, so the problem must be that ping generates output that gets piped to zenity. So I changed the command to redirect its output to the bit bucket:
ping -c 3 -W 3 gcr.io &>/dev/null;
...and that worked, too! Interesting!
I explored what turned out to be a few ratholes:
I ran ping from the command line and piped its output through od -xa to check for weird control characters, but nope.
Instead of enclosing the contents of the if block in parentheses (()), which executes the commands in a sub-shell, I tried braces ({}) to execute them in the same shell. Nope, again.
I tried a bunch of other embarrassingly useless and time-consuming ideas. Nope, nope, and nope.
Then I realized I could just do
ping -c 3 -W 3 gcr.io | zenity --progress --auto-close
directly from the command line. That failed with the --auto-close flag but worked normally without it. Boy, did that simplify things! That's about as "smallest" as you can get. But it's not, actually: I used up all of my remaining intelligence points for the day by redirecting the output from ping into a file, so I could just
(cat output; sleep 1) | zenity --progress --auto-close
and not keep poking at poor gcr.io until I finally figured this thing out. (The sleep gave me enough time to see the pop-up when it worked, because zenity exits when the pipe closes at the end of the input. So, what's in that output file?
PING gcr.io (172.253.122.82) 56(84) bytes of data.
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=1 ttl=59 time=18.5 ms
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=2 ttl=59 time=21.8 ms
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=3 ttl=59 time=21.4 ms
--- gcr.io ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 18.537/20.572/21.799/1.449 ms
The magic zenity-killer must be in there somewhere! All that was left (ha, "all"!) was to make my "smallest" example even smaller by deleting pieces of the file until it stopped breaking. Then I'd put back whatever I'd deleted last, and I deleted something else, da capo, ad nauseam, or at least ad minimus. (Or whatever; I don't speak Latin.) Eventually the file dwindled to
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=1 ttl=59 time=18.5 ms
and I started deleting stuff from the beginning. Eventually I found that it would break regardless of the length of the line, as long as it started with a number that wasn't 0 and had at least 3 digits somewhere within it. Huh. It'd also break if it did start with a 0 and had at least 4 digits within... unless the second digit was also 0! What's more, a period would make it even weirder: none of the digits anywhere after the period would make it break, no matter what they were.
And then, then came the ah-ha! moment. The zenity documentation says:
Zenity reads data from standard input line by line. If a line is
prefixed with #, the text is updated with the text on that line. If a
line contains only a number, the percentage is updated with that
number.
Wow, really? It can't be that ridiculous, can it?
I found the source for zenity, downloaded it, extracted it (with tar -xf zenity-3.42.1.tar.xz), opened progress.c, and found the function that checks to see if "a line contains only a number." The function is called only if the first character in the line is a number.
108 static float
109 stof(const char* s) {
110 float rez = 0, fact = 1;
111 if (*s == '-') {
112 s++;
113 fact = -1;
114 }
115 for (int point_seen = 0; *s; s++) {
116 if (*s == '.' || *s == ',') {
117 point_seen = 1;
118 continue;
119 }
120 int d = *s - '0';
121 if (d >= 0 && d <= 9) {
122 if (point_seen) fact /= 10.0f;
123 rez = rez * 10.0f + (float)d;
124 }
125 }
126 return rez * fact;
127 }
Do you see it yet? Here, I'll give you a sscce, with comments:
// Clear the "found a decimal point" flag and iterate
// through the input in `s`.
115 for (int point_seen = 0; *s; s++) {
// If the next char is a decimal point (or a comma,
// for Europeans), set the "found it" flag and check
// the next character.
116 if (*s == '.' || *s == ',') {
117 point_seen = 1;
118 continue;
119 }
// Sneaky C trick that converts a numeric character
// to its integer value. Ex: char '1' becomes int 1.
120 int d = *s - '0';
// We only care if it's actually an integer; skip anything else.
121 if (d >= 0 && d <= 9) {
// If we saw a decimal point, we're looking at tenths,
// hundredths, thousandths, etc., so we'll need to adjust
// the final result. (Note from the peanut gallery: this is
// just ridiculous. A progress bar doesn't need to be this
// accurate. Just quit at the first decimal point instead
// of trying to be "clever."
122 if (point_seen) fact /= 10.0f;
// Tack the new digit onto the end of the "rez"ult.
// Ex: if rez = 12.0 and d = 5, this is 12.0 * 10.0 + 5. = 125.
123 rez = rez * 10.0f + (float)d;
124 }
125 }
// We've scanned the entire line, so adjust the result to account
// for the decimal point and return the number.
126 return rez * fact;
Now do you see it?
The author decides "[i]f a line contains only a number" by checking (only!) that the first character is a number. If it is, then it plucks out all the digits (and the first decimal, if there is one), mashes them all together, and returns whatever it found, ignoring anything else it may have seen.
So of course it failed if there were 3 digits and the first wasn't 0, or if there were 4 digits and the first 2 weren't 0... because a 3-digit number is always at least 100, and zenity will --auto-close as soon as the progress is 100 or higher.
Spoiler:
The ping statement generates output that confuses zenity into thinking the progress has reached 100%, so it closes the dialog.
By the way, congratulations: you found one of the rookiest kinds of rookie mistakes a programmer can make... and it's not your bug! For whatever reason, the author of zenity decided to roll their own function to convert a line of text to a floating-point number, and it doesn't do at all what the doc says, or what any normal person would expect it to do. (Protip: libraries will do this for you, and they'll actually work most of the time.)
You can score a bunch of karma points if you can figure out how to report the bug, and you'll get a bonus if you submit your report in the form of a fix. ๐Ÿ˜€๏ธ

How to eliminate JIT overhead in a Julia executable (with MWE)

I'm using PackageCompiler hoping to create an executable that eliminates just-in-time compilation overhead.
The documentation explains that I must define a function julia_main to call my program's logic, and write a "snoop file", a script that calls functions I wish to precompile. My julia_main takes a single argument, the location of a file containing the input data to be analysed. So to keep things simple my snoop file simply makes one call to julia_main with a particular input file. So I'd hope to see the generated executable run nice and fast (no compilation overhead) when executed against that same input file.
But alas, that's not what I see. In a fresh Julia instance julia_main takes approx 74 seconds for the first execution and about 4.5 seconds for subsequent executions. The executable file takes approx 50 seconds each time it's called.
My use of the build_executable function looks like this:
julia> using PackageCompiler
julia> build_executable("d:/philip/source/script/julia/jsource/SCRiPTMain.jl",
"testexecutable",
builddir = "d:/temp/builddir4",
snoopfile = "d:/philip/source/script/julia/jsource/snoop.jl",
compile = "all",
verbose = true)
Questions:
Are the above arguments correct to achieve my aim of an executable with no JIT overhead?
Any other advice for me?
Here's what happens in response to that call to build_executable. The lines from Start of snoop file execution! to End of snoop file execution! are emitted by my code.
Julia program file:
"d:\philip\source\script\julia\jsource\SCRiPTMain.jl"
C program file:
"C:\Users\Philip\.julia\packages\PackageCompiler\CJQcs\examples\program.c"
Build directory:
"d:\temp\builddir4"
Executing snoopfile: "d:\philip\source\script\julia\jsource\snoop.jl"
Start of snoop file execution!
โ”Œ Warning: The 'control file' contains the key 'InterpolateCovariance' with value 'true' but that is not supported. Pass a value of 'false' or omit the key altogether.
โ”” # ValidateInputs d:\Philip\Source\script\Julia\JSource\ValidateInputs.jl:685
Time to build model 20.058000087738037
Saving c:/temp/SCRiPT/SCRiPTModel.jls
Results written to c:/temp/SCRiPT/SCRiPTResultsJulia.json
Time to write file: 3620 milliseconds
Time in method runscript: 76899 milliseconds
End of snoop file execution!
[ Info: used 1313 out of 1320 precompile statements
Build static library "testexecutable.a":
atexit_hook_copy = copy(Base.atexit_hooks) # make backup
# clean state so that any package we use can carelessly call atexit
empty!(Base.atexit_hooks)
Base.__init__()
Sys.__init__() #fix https://github.com/JuliaLang/julia/issues/30479
using REPL
Base.REPL_MODULE_REF[] = REPL
Mod = #eval module $(gensym("anon_module")) end
# Include into anonymous module to not polute namespace
Mod.include("d:\\\\temp\\\\builddir4\\\\julia_main.jl")
Base._atexit() # run all exit hooks we registered during precompile
empty!(Base.atexit_hooks) # don't serialize the exit hooks we run + added
# atexit_hook_copy should be empty, but who knows what base will do in the future
append!(Base.atexit_hooks, atexit_hook_copy)
Build shared library "testexecutable.dll":
`'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root\mingw\bin\gcc.exe' --sysroot 'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root' -shared '-DJULIAC_PROGRAM_LIBNAME="testexecutable.dll"' -o testexecutable.dll -Wl,--whole-archive testexecutable.a -Wl,--no-whole-archive -std=gnu99 '-IC:\Users\philip\AppData\Local\Julia-1.2.0\include\julia' -DJULIA_ENABLE_THREADING=1 '-LC:\Users\philip\AppData\Local\Julia-1.2.0\bin' -Wl,--stack,8388608 -ljulia -lopenlibm -m64 -Wl,--export-all-symbols`
Build executable "testexecutable.exe":
`'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root\mingw\bin\gcc.exe' --sysroot 'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root' '-DJULIAC_PROGRAM_LIBNAME="testexecutable.dll"' -o testexecutable.exe 'C:\Users\Philip\.julia\packages\PackageCompiler\CJQcs\examples\program.c' testexecutable.dll -std=gnu99 '-IC:\Users\philip\AppData\Local\Julia-1.2.0\include\julia' -DJULIA_ENABLE_THREADING=1 '-LC:\Users\philip\AppData\Local\Julia-1.2.0\bin' -Wl,--stack,8388608 -ljulia -lopenlibm -m64`
Copy Julia libraries to build directory:
7z.dll
BugpointPasses.dll
libamd.2.4.6.dll
libamd.2.dll
libamd.dll
libatomic-1.dll
libbtf.1.2.6.dll
libbtf.1.dll
libbtf.dll
libcamd.2.4.6.dll
libcamd.2.dll
libcamd.dll
libccalltest.dll
libccolamd.2.9.6.dll
libccolamd.2.dll
libccolamd.dll
libcholmod.3.0.13.dll
libcholmod.3.dll
libcholmod.dll
libclang.dll
libcolamd.2.9.6.dll
libcolamd.2.dll
libcolamd.dll
libdSFMT.dll
libexpat-1.dll
libgcc_s_seh-1.dll
libgfortran-4.dll
libgit2.dll
libgmp.dll
libjulia.dll
libklu.1.3.8.dll
libklu.1.dll
libklu.dll
libldl.2.2.6.dll
libldl.2.dll
libldl.dll
libllvmcalltest.dll
libmbedcrypto.dll
libmbedtls.dll
libmbedx509.dll
libmpfr.dll
libopenblas64_.dll
libopenlibm.dll
libpcre2-8-0.dll
libpcre2-8.dll
libpcre2-posix-2.dll
libquadmath-0.dll
librbio.2.2.6.dll
librbio.2.dll
librbio.dll
libspqr.2.0.9.dll
libspqr.2.dll
libspqr.dll
libssh2.dll
libssp-0.dll
libstdc++-6.dll
libsuitesparseconfig.5.4.0.dll
libsuitesparseconfig.5.dll
libsuitesparseconfig.dll
libsuitesparse_wrapper.dll
libumfpack.5.7.8.dll
libumfpack.5.dll
libumfpack.dll
libuv-2.dll
libwinpthread-1.dll
LLVM.dll
LLVMHello.dll
zlib1.dll
All done
julia>
EDIT
I was afraid that creating a minimal working example would be hard, but it was straightforward:
TestBuildExecutable.jl contains:
module TestBuildExecutable
Base.#ccallable function julia_main(ARGS::Vector{String}=[""])::Cint
#show sum(myarray())
return 0
end
#Function which takes approx 8 seconds to compile. Returns a 500 x 20 array of 1s
function myarray()
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;
# PLEASE EDIT TO INSERT THE MISSING 496 LINES, EACH IDENTICAL TO THE LINE ABOVE!
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
end
end #module
SnoopFile.jl contains:
module SnoopFile
currentpath = dirname(#__FILE__)
push!(LOAD_PATH, currentpath)
unique!(LOAD_PATH)
using TestBuildExecutable
println("Start of snoop file execution!")
TestBuildExecutable.julia_main()
println("End of snoop file execution!")
end # module
In a fresh Julia instance, julia_main takes 8.3 seconds for the first execution and half a millisecond for the second execution:
julia> #time TestBuildExecutable.julia_main()
sum(myarray()) = 10000
8.355108 seconds (425.36 k allocations: 25.831 MiB, 0.06% gc time)
0
julia> #time TestBuildExecutable.julia_main()
sum(myarray()) = 10000
0.000537 seconds (25 allocations: 82.906 KiB)
0
So next I call build_executable:
julia> using PackageCompiler
julia> build_executable("d:/philip/source/script/julia/jsource/TestBuildExecutable.jl",
"testexecutable",
builddir = "d:/temp/builddir15",
snoopfile = "d:/philip/source/script/julia/jsource/SnoopFile.jl",
verbose = false)
Julia program file:
"d:\philip\source\script\julia\jsource\TestBuildExecutable.jl"
C program file:
"C:\Users\Philip\.julia\packages\PackageCompiler\CJQcs\examples\program.c"
Build directory:
"d:\temp\builddir15"
Start of snoop file execution!
sum(myarray()) = 10000
End of snoop file execution!
[ Info: used 79 out of 79 precompile statements
All done
Finally, in a Windows Command Prompt:
D:\temp\builddir15>testexecutable
sum(myarray()) = 1000
D:\temp\builddir15>
which took (by my stopwatch) 8 seconds to run, and it takes 8 seconds to run every time it's executed, not just the first time. This is consistent with the executable doing a JIT compile every time it's run, but the snoop file is designed to avoid that!
Version information:
julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-6700 CPU # 3.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 8
JULIA_EDITOR = "C:\Users\Philip\AppData\Local\Programs\Microsoft VS Code\Code.exe"
Looks like you are using Windows.
At some point PackageCompiler.jl will be mature for Windows at which you can try it.
The solution was indeed to wait for progress on PackageCompilerX, as suggested by #xiaodai.
On 10 Feb 2020 what was formerly PackageCompilerX became a new (version 1.0 of) PackageCompiler, with a significantly changed API, and more thorough documentation.
In particular, the MWE above (mutated for the new API to PackageCompiler) now works correctly without any JIT overhead.

sched_wakeup_granularity_ns , sched_min_granularity_ns and SCHED_RR

The following value from my box :
sysctl -A | grep "sched" | grep -v "domain"
kernel.sched_autogroup_enabled = 0
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_child_runs_first = 0
kernel.sched_latency_ns = 18000000
kernel.sched_migration_cost_ns = 5000000
kernel.sched_min_granularity_ns = 10000000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_shares_window_ns = 10000000
kernel.sched_time_avg_ms = 1000
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 3000000
It means in one second , 0.95 second is for SCHED_FIFO or SCHED_RR ,
only 0.05 reserved for SCHED_OTHER , What I am curious is
sched_wakeup_granularity_ns , I have googled it and get the explanation :
Ability of tasks being woken to preempt the current task.
The smaller the value, the easier it is for the task to force the preemption
I think sched_wakeup_granularity_ns only effect SCHED_OTHER task ,
the SCHED_FIFO and SCHED_RR should not in sleep mode , so no need to "wakeup",
am I correct ?!
and for sched_min_granularity_ns, the explanation is :
Minimum preemption granularity for processor-bound tasks.
Tasks are guaranteed to run for this minimum time before they are preempted
I like to know , although SCHED_RR tasks can has 95% of cpu time , But
since the sched_min_granularity_ns value = 10000000 , it is 0.01 second ,
that means that every SCHED_OTHER get 0.01 second timeslice to run before been preempted unless it is blocked by blocking socket or sleep or else , it imply that if I have 3 tasks in core 1 for example , 2 tasks with SCHED_RR , the third task with SCHED_OTHER , and the third task just run a endless loop without blocking socket recv and without yield , so once the third task get the cpu and run , it will run 0.01 second
and then context switch out , even the next task is priority with SCHED_RR ,
it is the right understaning for sched_min_granularity_ns usage ?!
Edit :
http://lists.pdxlinux.org/pipermail/plug/2006-February/045495.html
describe :
No SCHED_OTHER process may be preempted by another SCHED_OTHER process.
However a SCHED_RR or SCHED_FIFO process will preempt SCHED_OTHER
process before their time slice is done. So a SCHED_RR process
should wake up from a sleep with fairly good accuracy.
means SCHED_RR task can preempt the endless while loop without blocking even
time slice is not done ?!
Tasks with a higher scheduling class "priority" will preempt all tasks with a lower priority scheduling class, regardless of any timeouts. Take a look at the below snippet from kernel/sched/core.c:
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;
if (p->sched_class == rq->curr->sched_class) {
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
for_each_class(class) {
if (class == rq->curr->sched_class)
break;
if (class == p->sched_class) {
resched_curr(rq);
break;
}
}
}
/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq, true);
}
for_each_class will return the classes in this order: stop, deadline, rt, fair, idle. The loop will stop when trying to preempt a task with the same scheduling class as the preempting task.
So for your question, the answer is yes, an "rt" task will preempt a "fair" task.

Garbage collector in Ruby 2.2 provokes unexpected CoW

How do I prevent the GC from provoking copy-on-write, when I fork my process ? I have recently been analyzing the garbage collector's behavior in Ruby, due to some memory issues that I encountered in my program (I run out of memory on my 60core 0.5Tb machine even for fairly small tasks). For me this really limits the usefulness of ruby for running programs on multicore servers. I would like to present my experiments and results here.
The issue arises when the garbage collector runs during forking. I have investigated three cases that illustrate the issue.
Case 1: We allocate a lot of objects (strings no longer than 20 bytes) in the memory using an array. The strings are created using a random number and string formatting. When the process forks and we force the GC to run in the child, all the shared memory goes private, causing a duplication of the initial memory.
Case 2: We allocate a lot of objects (strings) in the memory using an array, but the string is created using the rand.to_s function, hence we remove the formatting of the data compared to the previous case. We end up with a smaller amount of memory being used, presumably due to less garbage. When the process forks and we force the GC to run in the child, only part of the memory goes private. We have a duplication of the initial memory, but to a smaller extent.
Case 3: We allocate fewer objects compared to before, but the objects are bigger, such that the amount of memory allocated stays the same as in the previous cases. When the process forks and we force the GC to run in the child all the memory stays shared, i.e. no memory duplication.
Here I paste the Ruby code that has been used for these experiments. To switch between cases you only need to change the โ€œoptionโ€ value in the memory_object function. The code was tested using Ruby 2.2.2, 2.2.1, 2.1.3, 2.1.5 and 1.9.3 on an Ubuntu 14.04 machine.
Sample output for case 1:
ruby version 2.2.2
proces pid log priv_dirty shared_dirty
Parent 3897 post alloc 38 0
Parent 3897 4 fork 0 37
Child 3937 4 initial 0 37
Child 3937 8 empty GC 35 5
The exact same code has been written in Python and in all cases the CoW works perfectly fine.
Sample output for case 1:
python version 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2]
proces pid log priv_dirty shared_dirty
Parent 4308 post alloc 35 0
Parent 4308 4 fork 0 35
Child 4309 4 initial 0 35
Child 4309 10 empty GC 1 34
Ruby code
$start_time=Time.new
# Monitor use of Resident and Virtual memory.
class Memory
shared_dirty = '.+?Shared_Dirty:\s+(\d+)'
priv_dirty = '.+?Private_Dirty:\s+(\d+)'
MEM_REGEXP = /#{shared_dirty}#{priv_dirty}/m
# get memory usage
def self.get_memory_map( pids)
memory_map = {}
memory_map[ :pids_found] = {}
memory_map[ :shared_dirty] = 0
memory_map[ :priv_dirty] = 0
pids.each do |pid|
begin
lines = nil
lines = File.read( "/proc/#{pid}/smaps")
rescue
lines = nil
end
if lines
lines.scan(MEM_REGEXP) do |shared_dirty, priv_dirty|
memory_map[ :pids_found][pid] = true
memory_map[ :shared_dirty] += shared_dirty.to_i
memory_map[ :priv_dirty] += priv_dirty.to_i
end
end
end
memory_map[ :pids_found] = memory_map[ :pids_found].keys
return memory_map
end
# get the processes and get the value of the memory usage
def self.memory_usage( )
pids = [ $$]
result = self.get_memory_map( pids)
result[ :pids] = pids
return result
end
# print the values of the private and shared memories
def self.log( process_name='', log_tag="")
if process_name == "header"
puts " %-6s %5s %-12s %10s %10s\n" % ["proces", "pid", "log", "priv_dirty", "shared_dirty"]
else
time = Time.new - $start_time
mem = Memory.memory_usage( )
puts " %-6s %5d %-12s %10d %10d\n" % [process_name, $$, log_tag, mem[:priv_dirty]/1000, mem[:shared_dirty]/1000]
end
end
end
# function to delay the processes a bit
def time_step( n)
while Time.new - $start_time < n
sleep( 0.01)
end
end
# create an object of specified size. The option argument can be changed from 0 to 2 to visualize the behavior of the GC in various cases
#
# case 0 (default) : we make a huge array of small objects by formatting a string
# case 1 : we make a huge array of small objects without formatting a string (we use the to_s function)
# case 2 : we make a smaller array of big objects
def memory_object( size, option=1)
result = []
count = size/20
if option > 3 or option < 1
count.times do
result << "%20.18f" % rand
end
elsif option == 1
count.times do
result << rand.to_s
end
elsif option == 2
count = count/10
count.times do
result << ("%20.18f" % rand)*30
end
end
return result
end
##### main #####
puts "ruby version #{RUBY_VERSION}"
GC.disable
# print the column headers and first line
Memory.log( "header")
# Allocation of memory
big_memory = memory_object( 1000 * 1000 * 10)
Memory.log( "Parent", "post alloc")
lab_time = Time.new - $start_time
if lab_time < 3.9
lab_time = 0
end
# start the forking
pid = fork do
time = 4
time_step( time + lab_time)
Memory.log( "Child", "#{time} initial")
# force GC when nothing happened
GC.enable; GC.start; GC.disable
time = 8
time_step( time + lab_time)
Memory.log( "Child", "#{time} empty GC")
sleep( 1)
STDOUT.flush
exit!
end
time = 4
time_step( time + lab_time)
Memory.log( "Parent", "#{time} fork")
# wait for the child to finish
Process.wait( pid)
Python code
import re
import time
import os
import random
import sys
import gc
start_time=time.time()
# Monitor use of Resident and Virtual memory.
class Memory:
def __init__(self):
self.shared_dirty = '.+?Shared_Dirty:\s+(\d+)'
self.priv_dirty = '.+?Private_Dirty:\s+(\d+)'
self.MEM_REGEXP = re.compile("{shared_dirty}{priv_dirty}".format(shared_dirty=self.shared_dirty, priv_dirty=self.priv_dirty), re.DOTALL)
# get memory usage
def get_memory_map(self, pids):
memory_map = {}
memory_map[ "pids_found" ] = {}
memory_map[ "shared_dirty" ] = 0
memory_map[ "priv_dirty" ] = 0
for pid in pids:
try:
lines = None
with open( "/proc/{pid}/smaps".format(pid=pid), "r" ) as infile:
lines = infile.read()
except:
lines = None
if lines:
for shared_dirty, priv_dirty in re.findall( self.MEM_REGEXP, lines ):
memory_map[ "pids_found" ][pid] = True
memory_map[ "shared_dirty" ] += int( shared_dirty )
memory_map[ "priv_dirty" ] += int( priv_dirty )
memory_map[ "pids_found" ] = memory_map[ "pids_found" ].keys()
return memory_map
# get the processes and get the value of the memory usage
def memory_usage( self):
pids = [ os.getpid() ]
result = self.get_memory_map( pids)
result[ "pids" ] = pids
return result
# print the values of the private and shared memories
def log( self, process_name='', log_tag=""):
if process_name == "header":
print " %-6s %5s %-12s %10s %10s" % ("proces", "pid", "log", "priv_dirty", "shared_dirty")
else:
global start_time
Time = time.time() - start_time
mem = self.memory_usage( )
print " %-6s %5d %-12s %10d %10d" % (process_name, os.getpid(), log_tag, mem["priv_dirty"]/1000, mem["shared_dirty"]/1000)
# function to delay the processes a bit
def time_step( n):
global start_time
while (time.time() - start_time) < n:
time.sleep( 0.01)
# create an object of specified size. The option argument can be changed from 0 to 2 to visualize the behavior of the GC in various cases
#
# case 0 (default) : we make a huge array of small objects by formatting a string
# case 1 : we make a huge array of small objects without formatting a string (we use the to_s function)
# case 2 : we make a smaller array of big objects
def memory_object( size, option=2):
count = size/20
if option > 3 or option < 1:
result = [ "%20.18f"% random.random() for i in xrange(count) ]
elif option == 1:
result = [ str( random.random() ) for i in xrange(count) ]
elif option == 2:
count = count/10
result = [ ("%20.18f"% random.random())*30 for i in xrange(count) ]
return result
##### main #####
print "python version {version}".format(version=sys.version)
memory = Memory()
gc.disable()
# print the column headers and first line
memory.log( "header") # Print the headers of the columns
# Allocation of memory
big_memory = memory_object( 1000 * 1000 * 10) # Allocate memory
memory.log( "Parent", "post alloc")
lab_time = time.time() - start_time
if lab_time < 3.9:
lab_time = 0
# start the forking
pid = os.fork() # fork the process
if pid == 0:
Time = 4
time_step( Time + lab_time)
memory.log( "Child", "{time} initial".format(time=Time))
# force GC when nothing happened
gc.enable(); gc.collect(); gc.disable();
Time = 10
time_step( Time + lab_time)
memory.log( "Child", "{time} empty GC".format(time=Time))
time.sleep( 1)
sys.exit(0)
Time = 4
time_step( Time + lab_time)
memory.log( "Parent", "{time} fork".format(time=Time))
# Wait for child process to finish
os.waitpid( pid, 0)
EDIT
Indeed, calling the GC several times before forking the process solves the issue and I am quite surprised. I have also run the code using Ruby 2.0.0 and the issue doesn't even appear, so it must be related to this generational GC just like you mentioned.
However, if I call the memory_object function without assigning the output to any variables (I am only creating garbage), then the memory is duplicated. The amount of memory that is copied depends on the amount of garbage that I create - the more garbage, the more memory becomes private.
Any ideas how I can prevent this ?
Here are some results
Running the GC in 2.0.0
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 3664 post alloc 67 0
Parent 3664 4 fork 1 69
Child 3700 4 initial 1 69
Child 3700 8 empty GC 6 65
Calling memory_object( 1000*1000) in the child
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 3703 post alloc 67 0
Parent 3703 4 fork 1 70
Child 3739 4 initial 1 70
Child 3739 8 empty GC 15 56
Calling memory_object( 1000*1000*10)
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 3743 post alloc 67 0
Parent 3743 4 fork 1 69
Child 3779 4 initial 1 69
Child 3779 8 empty GC 89 5
UPD2
Suddenly figured out why all the memory is going private if you format the string -- you generate garbage during formatting, having GC disabled, then enable GC, and you've got holes of released objects in your generated data. Then you fork, and new garbage starts to occupy these holes, the more garbage - more private pages.
So i added a cleanup function to run GC each 2000 cycles (just enabling lazy GC didn't help):
count.times do |i|
cleanup(i)
result << "%20.18f" % rand
end
#......snip........#
def cleanup(i)
if ((i%2000).zero?)
GC.enable; GC.start; GC.disable
end
end
##### main #####
Which resulted in(with generating memory_object( 1000 * 1000 * 10) after fork):
RUBY_GC_HEAP_INIT_SLOTS=600000 ruby gc-test.rb 0
ruby version 2.2.0
proces pid log priv_dirty shared_dirty
Parent 2501 post alloc 35 0
Parent 2501 4 fork 0 35
Child 2503 4 initial 0 35
Child 2503 8 empty GC 28 22
Yes, it affects performance, but only before forking, i.e. increase load time in your case.
UPD1
Just found criteria by which ruby 2.2 sets old object bits, it's 3 GC's, so if you add following before forking:
GC.enable; 3.times {GC.start}; GC.disable
# start the forking
you will get(the option is 1 in command line):
$ RUBY_GC_HEAP_INIT_SLOTS=600000 ruby gc-test.rb 1
ruby version 2.2.0
proces pid log priv_dirty shared_dirty
Parent 2368 post alloc 31 0
Parent 2368 4 fork 1 34
Child 2370 4 initial 1 34
Child 2370 8 empty GC 2 32
But this needs to be further tested concerning the behavior of such objects on future GC's, at least after 100 GC's :old_objects remains constant, so i suppose it should be OK
Log with GC.stat is here
By the way there's also option RGENGC_OLD_NEWOBJ_CHECK to create old objects from the beginning, but i doubt it's a good idea, but may be useful for a particular case.
First answer
My proposition in the comment above was wrong, actually bitmap tables are the savior.
(option = 1)
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 14807 post alloc 27 0
Parent 14807 4 fork 0 27
Child 14809 4 initial 0 27
Child 14809 8 empty GC 6 25 # << almost everything stays shared <<
Also had by hand and tested Ruby Enterprise Edition it's only half better than worst cases.
ruby version 1.8.7
proces pid log priv_dirty shared_dirty
Parent 15064 post alloc 86 0
Parent 15064 4 fork 2 84
Child 15065 4 initial 2 84
Child 15065 8 empty GC 40 46
(I made the script run strictly 1 GC, by increasing RUBY_GC_HEAP_INIT_SLOTS to 600k)

Is it faster to give Perl's print a list or a concatenated string?

option A:
print $fh $hr->{'something'}, "|", $hr->{'somethingelse'}, "\n";
option B:
print $fh $hr->{'something'} . "|" . $hr->{'somethingelse'} . "\n";
The answer is simple, it doesn't matter. As many folks have pointed out, this is not going to be your program's bottleneck. Optimizing this to even happen instantly is unlikely to have any effect on your performance. You must profile first, otherwise you are just guessing and wasting your time.
If we are going to waste time on it, let's at least do it right. Below is the code to do a realistic benchmark. It actually does the print and sends the benchmarking information to STDERR. You run it as perl benchmark.plx > /dev/null to keep the output from flooding your screen.
Here's 5 million iterations writing to STDOUT. By using both timethese() and cmpthese() we get all the benchmarking data.
$ perl ~/tmp/bench.plx 5000000 > /dev/null
Benchmark: timing 5000000 iterations of concat, list...
concat: 3 wallclock secs ( 3.84 usr + 0.12 sys = 3.96 CPU) # 1262626.26/s (n=5000000)
list: 4 wallclock secs ( 3.57 usr + 0.12 sys = 3.69 CPU) # 1355013.55/s (n=5000000)
Rate concat list
concat 1262626/s -- -7%
list 1355014/s 7% --
And here's 5 million writing to a temp file
$ perl ~/tmp/bench.plx 5000000
Benchmark: timing 5000000 iterations of concat, list...
concat: 6 wallclock secs ( 3.94 usr + 1.05 sys = 4.99 CPU) # 1002004.01/s (n=5000000)
list: 7 wallclock secs ( 3.64 usr + 1.06 sys = 4.70 CPU) # 1063829.79/s (n=5000000)
Rate concat list
concat 1002004/s -- -6%
list 1063830/s 6% --
Note the extra wallclock and sys time underscoring how what you're printing to matters as much as what you're printing.
The list version is about 5% faster (note this is counter to Pavel's logic underlining the futility of trying to just think this stuff through). You said you're doing tens of thousands of these? Let's see... 100k takes 146ms of wallclock time on my laptop (which has crappy I/O) so the best you can do here is to shave off about 7ms. Congratulations. If you spent even a minute thinking about this it will take you 40k iterations of that code before you've made up that time. This is not to mention the opportunity cost, in that minute you could have been optimizing something far more important.
Now, somebody's going to say "now that we know which way is faster we should write it the fast way and save that time in every program we write making the whole exercise worthwhile!" No. It will still add up to an insignificant portion of your program's run time, far less than the 5% you get measuring a single statement. Second, logic like that causes you to prioritize micro-optimizations over maintainability.
Oh, and its different in 5.8.8 as in 5.10.0.
$ perl5.8.8 ~/tmp/bench.plx 5000000 > /dev/null
Benchmark: timing 5000000 iterations of concat, list...
concat: 3 wallclock secs ( 3.69 usr + 0.04 sys = 3.73 CPU) # 1340482.57/s (n=5000000)
list: 5 wallclock secs ( 3.97 usr + 0.06 sys = 4.03 CPU) # 1240694.79/s (n=5000000)
Rate list concat
list 1240695/s -- -7%
concat 1340483/s 8% --
It might even change depending on what Perl I/O layer you're using and operating system. So the whole exercise is futile.
Micro-optimization is a fool's game. Always profile first and look to optimizing your algorithm. Devel::NYTProf is an excellent profiler.
#!/usr/bin/perl -w
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
#open my $fh, ">", "/tmp/test.out" or die $!;
#open my $fh, ">", "/dev/null" or die $!;
my $fh = *STDOUT;
my $hash = {
foo => "something and stuff",
bar => "and some other stuff"
};
select *STDERR;
my $r = timethese(shift || -3, {
list => sub {
print $fh $hash->{foo}, "|", $hash->{bar};
},
concat => sub {
print $fh $hash->{foo}. "|". $hash->{bar};
},
});
cmpthese($r);
Unless you are executing millions of these statements, the performance difference will not matter. I really suggest concentrating on performance problems where they do exist - and the only way to find that out is to profile your application.
Premature optimization is something that Joel and Jeff had a podcast on, and whined about, for years. It's just a waste of time to try to optimize something until you KNOW that it's slow.
Perl is a high-level language, and as such the statements you see in the source code don't map directly to what the computer is actually going to do. You might find that a particular implementation of perl makes one thing faster than the other, but that's no guarantee that another implementation might take away the advantage (although they try not to make things slower).
If you're worried about I/O speed, there are a lot more interesting and useful things to tweak before you start worrying about commas and periods. See, for instance, the discussion under Perl write speed mystery.
UPDATE:
I just ran my own test.
1,000,000 iterations of each version took each < 1 second.
10mm iterations of each version took an average of 2.35 seconds for list version vs. 2.1 seconds for string concat version
Have you actually tried profiling this? Only takes a few seconds.
On my machine, it appears that B is faster. However, you should really have a look at Pareto Analysis. You've already wasted far, far more time thinking about this question then you'd ever save in any program run. For problems as trivial as this (character substitution!), you should wait to care until you actually have a problem.
Of the three options, I would probably choose string interpolation first and switch to commas for expressions that cannot be interpolated. This, humorously enough, means that my default choice is the slowest of the bunch, but given that they are all so close to each other in speed and that disk speed is probably going to be slower than anything else, I don't believe changing the method has any real performance benefits.
As others have said, write the code, then profile the code, then examine the algorithms and data structures you have chosen that are in the slow parts of the code, and, finally, look at the implementation of the algorithms and data structures. Anything else is foolish micro-optimizing that wastes more time than it saves.
You may also want to read perldoc perlperf
Rate string concat comma
string 803887/s -- -0% -7%
concat 803888/s 0% -- -7%
comma 865570/s 8% 8% --
#!/usr/bin/perl
use strict;
use warnings;
use Carp;
use List::Util qw/first/;
use Benchmark;
sub benchmark {
my $subs = shift;
my ($k, $sub) = each %$subs;
my $value = $sub->();
croak "bad" if first { $value ne $_->() and print "$value\n", $_->(), "\n" } values %$subs;
Benchmark::cmpthese -1, $subs;
}
sub fake_print {
#this is, plus writing output to the screen is what print does
no warnings;
my $output = join $,, #_;
return $output;
}
my ($x, $y) = ("a", "b");
benchmark {
comma => sub { return fake_print $x, "|", $y, "\n" },
concat => sub { return fake_print $x . "|" . $y . "\n" },
string => sub { return fake_print "$x|$y\n" },
};

Resources