High frequency calls to 'VM Periodic Task Thread' - performance

Running a small jetty application on a raspberry pi I noticed that after the first access, the application keeps burning around 3% CPU. A quick inspection showed that the same is true, with less %, on my laptop. Checking with strace I find a never ending sequence of
...
12:58:01.999717 clock_gettime(CLOCK_MONOTONIC, {2923, 200177551}) = 0
12:58:01.999864 futex(0x693a0f44, FUTEX_WAIT_BITSET_PRIVATE, 1, {2923, 250177551}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
12:58:02.050090 futex(0x693a0f28, FUTEX_WAKE_PRIVATE, 1) = 0
12:58:02.050236 gettimeofday({1436093882, 50296}, NULL) = 0
12:58:02.050403 gettimeofday({1436093882, 50444}, NULL) = 0
12:58:02.050767 clock_gettime(CLOCK_MONOTONIC, {2923, 251228114}) = 0
...
(This is Java 7 on ubuntu 14.04 with Jetty 9.3.* using an h2 db, just in case this rings any bells for someone.)
I learned that it suffices to capture strace -f -tt -p <pid> -o out.txt, grep for clock_gettime, extract the pid, sort and uniq -c to find the thread calling clock_gettime most often. Plotting the delta times nicely shows a line at 50 milliseconds. Further the PID can be found in a thread dump taken with jvisualvm as the nid in hex and turns out to be 'VM Periodic Task Thread'. But why so often? This does not seem to be a standard behaviour of the JVM.

Related

Disable timer interrupt on Linux kernel

I want to disable the timer interrupt on some of the cores (1-2) on my machine which is a x86 running centos 7 with rt patch, both cores are isolated cores with nohz_full, (you can see the cmdline) but timer interrupt continues to interrupt the real time process which are running on core1 and core2.
1. uname -r
3.10.0-693.11.1.rt56.632.el7.x86_64
2. cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-693.11.1.rt56.632.el7.x86_64 \
root=/dev/mapper/centos-root ro crashkernel=auto \
rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet \
default_hugepagesz=2M hugepagesz=2M hugepages=1024 \
intel_iommu=on isolcpus=1-2 irqaffinity=0 intel_idle.max_cstate=0 \
processor.max_cstate=0 idle=mwait tsc=perfect rcu_nocbs=1-2 rcu_nocb_poll \
nohz_full=1-2 nmi_watchdog=0
3. cat /proc/interrupts
CPU0 CPU1 CPU2
0: 29 0 0 IO-APIC-edge timer
.....
......
NMI: 0 0 0 Non-maskable interrupts
LOC: 835205157 308723100 308384525 Local timer interrupts
SPU: 0 0 0 Spurious interrupts
PMI: 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 IRQ work interrupts
RTR: 0 0 0 APIC ICR read retries
RES: 347330843 309191325 308417790 Rescheduling interrupts
CAL: 0 935 935 Function call interrupts
TLB: 320 22 58 TLB shootdowns
TRM: 0 0 0 Thermal event interrupts
THR: 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 Machine check exceptions
MCP: 2 2 2 Machine check polls
CPUs/Clocksource:
4. lscpu | grep CPU.s
CPU(s): 3
On-line CPU(s) list: 0-2
NUMA node0 CPU(s): 0-2
5. cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Thanks a lot for any help.
Moses
Even with nohz_full= you get some ticks on the isolated CPUs:
Some process-handling operations still require the occasional scheduling-clock tick. These operations include calculating CPU load, maintaining sched average, computing CFS entity vruntime, computing avenrun, and carrying out load balancing. They are currently accommodated by scheduling-clock tick every second or so. On-going work will eliminate the need even for these infrequent scheduling-clock ticks.
(Documentation/timers/NO_HZ.txt, cf. (Nearly) full tickless operation in 3.10 LWN, 2013)
Thus, you have to check the rate of the local timer, e.g.:
$ perf stat -a -A -e irq_vectors:local_timer_entry sleep 120
(while your isolated threads/processes are running)
Also, nohz_full= is only effective if there is just one runnable task on each isolated core. You can check that with e.g. ps -L -e -o pid,tid,user,state,psr,cmd and cat /proc/sched_debug.
Perhaps you need to move some (kernel) tasks to you house-keeping core, e.g.:
# tuna -U -t '*' -c 0-4 -m
You can get more insights into what timers are still active by looking at /proc/timer_list.
Another method to investigate causes for possible interruption is to use the functional tracer (ftrace). See also Reducing OS jitter due to per-cpu kthreads for some examples.
I see nmi_watchdog=0 in your kernel parameters, but you don't disable the soft watchdog. Perhaps this is another timer tick source that would show up with ftrace.
You can disable all watchdogs with nowatchdog.
Btw, some of your kernel parameters seem to be off:
tsc=perfect - do you mean tsc=reliable? The 'perfect' value isn't documented in the kernel docs
idle=mwait - do you mean idle=poll? Again, I can't find the 'mwait' value in the kernel docs
intel_iommu=on - what's the purpose of this?

How to share memory in cluster machine (qsub openmpi)

dear all!
I have a question about sharing memory in cluster. I am a new to cluster, and fail to solve my problem after trying about several weeks, so I look for help here, any suggestion would be grateful!
I want to use soapdenovo, a software that was used to assemble human genome to assemble my data. However, it failed in one step because shortage of memory (the memory is 512G in my machine). So I turned to cluster machine (which have three big nodes, each node have 512 memory too), and started to learn submit job with qsub. Considering that one node couldn't solve my problem, I googled and found that openmpi may help, but when I running openmpi with demo data, it seemed it only run the command several times. Then I found to use openmpi, the software must include library of openmpi, and I didn't know whether soapdenovo is support openmpi, I had asked the question but the author didn't give me answer yet. Suppose soapdenovo support the openmpi, how should I solve my problem. If it didn't support openmpi, can I use memory in different nodes to run the software?
The problem had tortured my so much, thanks for any help. Following is what had I do and some information about the cluster machine:
Install openmpi and submit the job
1) The script of job:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
export PATH=/tools/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
workPath="/NGS"
outputPath="assembly/soap/demo"
/tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err
2) Submit the job:
qsub -pe orte 60 mpi.qsub
3) The log in ass.err
a) It seemed it run soapdenovo several times according to the log
cat ass.err | grep "Pregraph" | wc -l
60
b) detail information
less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix
In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
and so on
c) information of stdin
cat ass.log:
--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.
Host: smp03
PID: 75035
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Information about cluster:
1) qconf -sql
all.q
smp.q
2) qconf -spl
mpi
mpich
orte
zhongxm
3) qconf -sp zhongxm
pe_name zhongxm
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
4) qconf -sq smp.q
qname smp.q
hostlist #smp.q
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
5) qconf -sq all.q
qname all.q
hostlist #allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 16,[c0219.local=32]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists mobile
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
According to https://hpc.unt.edu/soapdenovo the software doesn't support MPI:
This code is NOT compiled with MPI, and should only be used in parallel on a SINGLE node, via a threaded model.
So, you can't just start the software with mpiexec on cluster to have access to more memory. Cluster machines are connected with non-coherent networks (Ethernet, Infiniband) which are slower than memory bus, and PCs in cluster do not share their memory. Clusters use MPI libraries (OpenMPI or MPICH) to work with network, and all requests between nodes is explicit: program calls MPI_Send in one process and MPI_Recv in other. There are also one-way calls like MPI_Put/MPI_Get to access remote memory (RDMA - remote direct memory access), but this is not the same as local memory.
osgx, thank you for your reply very much and sorry for the delay of this message.
Since I don't major in computer, I think I can't understand some glossary very well, like ELF. So there are some new questions and I list my question as follow, thanks for help advace:
1) When I "ldd SOAPdenovo-63mer", it outputed "not a dynamic executable", did this mean "the code is not complied with MPI" that you mentioned?
2) In short, I can't solve the problem with the cluster, and I have to look for a machine with more than 512G memory?
3) Also, I used another software called ALLPATHS-LG (http://www.broadinstitute.org/software/allpaths-lg/blog/) that was also failed for shortage of memory, and according to FAQ C1 (http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336), what "it uses share memory parallelization" mean, did it means it can use memory in cluster, or only memory in a node, and I have to find a machine with enough memory?
C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster. That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.
By the way, this is first time I posted here, I think I should use commit to reply, considering so many words, I use "Answer Your Question".

thin using high cpu and not replying to request

After a timeout occurs on thin, the process keep using high cpu. The only way is to restart it (I let it run more than a day)
this is the output of strace
ruby#localhost:~$ strace -p 17830
Process 17830 attached
brk(0x7cf38000) = 0x7cf38000
brk(0x7d3ac000) = 0x7d3ac000
brk(0x7d655000) = 0x7d655000
brk(0x7de8c000) = 0x7de8c000
brk(0x7e616000) = 0x7e616000
brk(0x7a0c9000) = 0x7a0c9000
.. and one similar line each 3 - 4 seconds
Why this happen? I've seen this also on mongrel. Why if the http request already ended it keeps up?

How to obtain the virtual private memory of a process from the command line under OSX?

I would like to obtain the virtual private memory consumed by a process under OSX from the command line. This is the value that Activity Monitor reports in the "Virtual Mem" column. ps -o vsz reports the total address space available to the process and is therefore not useful.
You can obtain the virtual private memory use of a single process by running
top -l 1 -s 0 -i 1 -stats vprvt -pid PID
where PID is the process ID of the process you are interested in. This results in about a dozen lines of output ending with
VPRVT
55M+
So by parsing the last line of output, one can at least obtain the memory footprint in MB. I tested this on OSX 10.6.8.
update
I realized (after I got downvoted) that #user1389686 gave an answer in the comment section of the OP that was better than my paltry first attempt. What follows is based on user1389686's own answer. I cannot take credit for it -- I've just cleaned it up a bit.
original, edited with -stats vprvt
As Mahmoud Al-Qudsi mentioned, top does what you want. If PID 8631 is the process you want to examine:
$ top -l 1 -s 0 -stats vprvt -pid 8631
Processes: 84 total, 2 running, 82 sleeping, 378 threads
2012/07/14 02:42:05
Load Avg: 0.34, 0.15, 0.04
CPU usage: 15.38% user, 30.76% sys, 53.84% idle
SharedLibs: 4668K resident, 4220K data, 0B linkedit.
MemRegions: 15160 total, 961M resident, 25M private, 520M shared.
PhysMem: 917M wired, 1207M active, 276M inactive, 2400M used, 5790M free.
VM: 171G vsize, 1039M framework vsize, 1523860(0) pageins, 811163(0) pageouts.
Networks: packets: 431147/140M in, 261381/59M out.
Disks: 487900/8547M read, 2784975/40G written.
VPRVT
8631
Here's how I get at this value using a bit of Ruby code:
# Return the virtual memory size of the current process
def virtual_private_memory
s = `top -l 1 -s 0 -stats vprvt -pid #{Process.pid}`.split($/).last
return nil unless s =~ /\A(\d*)([KMG])/
$1.to_i * case $2
when "K"
1000
when "M"
1000000
when "G"
1000000000
else
raise ArgumentError.new("unrecognized multiplier in #{f}")
end
end
Updated answer, thats work under Yosemite, from user1389686:
top -l 1 -s 0 -stats mem -pid PID

Linpack sometimes starting, sometimes not, but nothing changed

I installed Linpack on a 2-Node cluster with Xeon processors. Sometimes if I start Linpack with this command:
mpiexec -np 28 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
linpack starts and prints the output, sometimes I only see the mpi mappings printed and then nothing following. To me this seems like random behaviour because I don't change anything between the calls and as already mentioned, Linpack sometimes starts, sometimes not.
In top I can see that xhpl_intel64processes have been created and they are heavily using the CPU but when watching the traffic between the nodes, iftop is telling me that it nothing is sent.
I am using MPICH2 as MPI implementation. This is my HPL.dat:
# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10000 Ns
1 # of NBs
250 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
14 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
edit2:
I now just let the program run for a while and after 30min it tells me:
# mpiexec -np 32 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
(node-0:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(node-1:16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)
Assertion failed in file ../../socksm.c at line 2577: (it_plfd->revents & 0x008) == 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
Is this a mpi problem?
Do you know what type of problem this could be?
I figured out what the problem was: MPICH2 uses different random ports each time it starts and if these are blocked your application wont start up correctly.
The solution for MPICH2 is to set the environment variable MPICH_PORT_RANGE to START:END, like this:
export MPICH_PORT_RANGE=50000:51000
Best,
heinrich

Resources