samtools calmd is pretty slow - bioinformatics

I am using "samtools calmd" to add MD tag back to BAM file. The size of original BAM is around 50Gb (whole genome sequence by using pacbio HIFI reads). The issue that I encountered is that the speed of "calmd" is incredibly slow! The jobs have already run 12 hours, and only 600MB BAM with MD tag are generated. In this way, 50GB BAM will take 30days to be finished!
Here is the code I used to add MD tag (very normal):
rule addMDTag:
input:
rules.pbmm2_alignment.output
output:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Tmp/rawReads{readsIndex}.MD.bam"
params:
ref = strRef
threads:
16
log:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Log/rawReads{readsIndex}.MD.log"
benchmark:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Benchmark/rawReads{readsIndex}.MD.benchmark.txt"
shell:
"samtools calmd -# {threads} {input} {params.ref} -bAr > {output}"
The version of samtools I used is v1.10.
BTW, I use 16 cores to run calmd, however, it looks like the samtools is still using 1 core to run it:
top - 11:44:53 up 47 days, 20:35, 1 user, load average: 2.00, 2.01, 2.00
Tasks: 1723 total, 3 running, 1720 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 0.3%sy, 0.0%ni, 96.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 529329180k total, 232414724k used, 296914456k free, 84016k buffers
Swap: 12582908k total, 74884k used, 12508024k free, 227912476k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
93137 lix33 20 0 954m 151m 2180 R 100.2 0.0 659:04.13 samtools
May I know how to make calmd be much faster? Or is there any other tool that can do the same job more efficiently?
Thanks so much

After the collaboration with samtools maintenance team, this issue has been solved.
The calmd will be super slow if the bam was unsorted. Therefore, always make sure the BAM has been sorted before run calmd.
See the details below:
Are your files name sorted, and does your reference have more than one entry?
If so calmd will be switching between references all the time,
which means it may be doing a lot of reference loading and not much MD calculation.
You may find it goes a lot faster if you position-sort the input, and then run it through calmd.

Related

Calculate md5 on a single 1T file, or on 100 10G files, which one is faster? Or the speed are the same?

I have a huge 1T file on my local machine and one on the remote server. I need to calculate their md5 to check if they are exactly the same. Since it will take long time to calculate md5 from them, I want to do some research on the md5 speed. I can calculate md5 directly against the whole file, or split it into 100 10G files and calculate md5 on them. I want to know which one is faster, or will they have the same speed?
As I was trying to say in the comments, it will depend on lots of things like the speed of your disk subsystem, your CPU performance and so on.
Here is an example. Create a 120GB file and check its size:
dd if=/dev/random of=junk bs=1g count=120
ls -lh junk
-rw-r--r-- 1 mark staff 120G 5 Oct 13:34 junk
Checksum in one go:
time md5sum junk
3c8fb0d5397be5a8b996239f1f5ce2f0 junk
real 3m55.713s <--- 4 minutes
user 3m28.441s
sys 0m24.871s
Checksum in 10GB chunks, with 12 CPU cores in parallel:
time parallel -k --pipepart --recend '' --recstart '' --block 10G -a junk md5sum
29010b411a251ff467a325bfbb665b0d -
793f02bb52407415b2bfb752827e3845 -
bf8f724d63f972251c2973c5bc73b68f -
d227dcb00f981012527fdfe12b0a9e0e -
5d16440053f78a56f6233b1a6849bb8a -
dacb9fb1ef2b564e9f6373a4c2a90219 -
ba40d6e7d6a32e03fabb61bb0d21843a -
5a5ee62d91266d9a02a37b59c3e2d581 -
95463c030b73c61d8d4f0e9c5be645de -
4bcd7d43849b65d98d9619df27c37679 -
92bc1f80d35596191d915af907f4d951 -
44f3cb8a0196ce37c323e8c6215c7771 -
real 1m0.046s <--- 1 minute
user 4m51.073s
sys 3m51.335s
It takes 1/4 of the time on my machine, but your mileage will vary... depending on your disk subsystem, your CPU etc.

Bad Disk performance after moving from Ubuntu to Centos 7

Relatively old Dell R620 server (32 cores / 128GB RAM) was working perfect for years with Ubuntu. Plain OS install, no Virtualization.
2 system disks in mirror (XFS)
6 RAID 5 disks for /var (XFS)
server is used for a nightly check of a MySQL Xtrabackup file.
Before the format and move to Centos 7 the process would finish by 08:00, Now running late at noon.
99% of the job is opening a large tar.gz file.
htop : there are only two processes doing something :
1. gzip -d : about 20% CPU
2. tar zxf Xtrabackup.tar.gz : about 4-7% CPU
iotop : it's steady at around 3M/s (Read) / 20-25 M/s (Write) which is about 25% of what i would expect at minimum.
Memory : Used : 1GB of 128GB
Server is fully updated both OS / HW / Firmware including the disks firmware.
IDRAC shows no problems.
Bottom line : Server is not working hard (to say the least) but performance is way off.
Any ideas would be appreciated.
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 2 0 469072 0 130362040 0 0 57 341 0 0 0 0 98 2 0
0 2 0 456916 0 130374568 0 0 3328 24576 1176 3241 2 1 94 4 0
You have blocked processes and also io operations (around 20MB/s). And this mean for me you have few processes which concurrently access disc resources. What you can do to improve the performance is instead of
tar zxf Xtrabackup.tar.gz
use
gzip -d Xtrabackup.tar.gz|tar xvf -
The second add parallelism and can benefit from multy processor, You can also benefit from increase of the pipe (fifo) buffer. Check this answer for some ideas
Also consider to tune filesystem where are stored output files of tar

Puppet agent hangs and eventually gives a memory allocation error

I'm using puppet as a provisioner for Vagrant, and am coming across an issue where Puppet will hang for an extremely long time when I do a "vagrant provision". Building the box from scratch using "vagrant up" doesn't seem to be a problem, only subsequent provisions.
If I turn puppet debug on and watch where it hangs, it seems to stop at various, seemingly arbitrary, points the first of which is:
Info: Applying configuration version '1401868442'
Debug: Prefetching yum resources for package
Debug: Executing '/bin/rpm --version'
Debug: Executing '/bin/rpm -qa --nosignature --nodigest --qf '%{NAME} %|EPOCH?{% {EPOCH}}:{0}| %{VERSION} %{RELEASE} %{ARCH}\n''
Executing this command on the server myself returns immediately.
Eventually, it gets past this and continues. Using the summary option, I get the following, after waiting for a very long time for it to complete:
Debug: Finishing transaction 70191217833880
Debug: Storing state
Debug: Stored state in 9.39 seconds
Notice: Finished catalog run in 1493.99 seconds
Changes:
Total: 2
Events:
Failure: 2
Success: 2
Total: 4
Resources:
Total: 18375
Changed: 2
Failed: 2
Skipped: 35
Out of sync: 4
Time:
User: 0.00
Anchor: 0.01
Schedule: 0.01
Yumrepo: 0.07
Augeas: 0.12
Package: 0.18
Exec: 0.96
Service: 1.07
Total: 108.93
Last run: 1401869964
Config retrieval: 16.49
Mongodb database: 3.99
File: 76.60
Mongodb user: 9.43
Version:
Config: 1401868442
Puppet: 3.4.3
This doesn't seem very helpful to me, as the amount of time total's 108 seconds, so where have the other 1385 seconds gone?
Throughout, Puppet seems to be hammering the box, using up a lot of CPU, but still doesn't seem to advance. The memory it uses seems to continually increase. When I kick off the command, top looks like this:
Cpu(s): 10.2%us, 2.2%sy, 0.0%ni, 85.5%id, 2.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4956928k total, 2849296k used, 2107632k free, 63464k buffers
Swap: 950264k total, 26688k used, 923576k free, 445692k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28486 root 20 0 439m 334m 3808 R 97.5 6.9 2:02.92 puppet
22 root 20 0 0 0 0 S 1.3 0.0 0:07.55 kblockd/0
18276 mongod 20 0 788m 31m 3040 S 1.3 0.6 2:31.82 mongod
20756 jboss-as 20 0 3081m 1.5g 21m S 1.3 31.4 7:13.15 java
20930 elastics 20 0 2340m 236m 6580 S 1.0 4.9 1:44.80 java
266 root 20 0 0 0 0 S 0.3 0.0 0:03.85 jbd2/dm-0-8
22717 vagrant 20 0 98.0m 2252 1276 S 0.3 0.0 0:01.81 sshd
28762 vagrant 20 0 15036 1228 932 R 0.3 0.0 0:00.10 top
1 root 20 0 19364 1180 964 S 0.0 0.0 0:00.86 init
To me, this seems fine, there's over 2GB of available memory and plenty of available swap. I have a max open files limit of 1024.
About 10-15 minutes later, still no advance in the console output, but top looks like this:
Cpu(s): 11.2%us, 1.6%sy, 0.0%ni, 86.9%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%s
Mem: 4956928k total, 3834376k used, 1122552k free, 64248k buffers
Swap: 950264k total, 24408k used, 925856k free, 445728k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28486 root 20 0 1397m 1.3g 3808 R 99.6 26.7 15:16.19 puppet
18276 mongod 20 0 788m 31m 3040 R 1.7 0.6 2:45.03 mongod
20756 jboss-as 20 0 3081m 1.5g 21m S 1.3 31.4 7:25.93 java
20930 elastics 20 0 2340m 238m 6580 S 0.7 4.9 1:52.03 java
8486 root 20 0 308m 952 764 S 0.3 0.0 0:06.03 VBoxService
As you can see, puppet is now using a lot more of the memory, and it seems to continue in this fashion. The box it's building has 5GB of RAM, so I wouldn't have expected it to have memory issues. However, further down the line, after a long wait, I do get "Cannot allocate memory - fork(2)"
Running unlimit -a, I get:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 38566
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Which, again looks fine to me...
To be honest, I'm completely at a loss as to how to go about solving this, or what is causing it.
Any help or insight would be greatly appreciated!
EDIT:
So I managed to fix this eventually... It came down to using recurse with a file directive for a large directory. The target directory in question contained around 2GB worth of files, and puppet took a huge amount of time loading this into memory and doing it's hashes and comparisons. The first time I stood the server up, the directory was relatively empty so the check was quick, but then other resources were placed in it that increased its size massively, meaning subsequent runs took much longer.
The memory error that eventually was thrown was because, I can only assume, Puppet was loading the whole thing into memory in order to do its stuff...
I found a way around using the recurse function, and am now trying to avoid it like the plague...
Yeah, the problem with the recurse parameter on the file type is that it checks every single file's checksum, which on a massive directory adds up real quick.
As Felix suggests, using checksum => none is one way to fix it, another is to accomplish the task you're trying to do (say chmod or chown a whole directory) with an exec performing the native task, with an unless to check if it's already been done.
Something like:
define check_mode($mode) {
exec { "/bin/chmod $mode $name":
unless => "/bin/sh -c '[ $(/usr/bin/stat -c %a $name) == $mode ]'",
}
}
Taken from http://projects.puppetlabs.com/projects/1/wiki/File_Permission_Check_Patterns

Using the top program in bash to extract cpu time into a variable

I have an assignment in bash scripting trying to measure cpu time used for a process passed into the script by name. I can find the process id and pass it to the top program in bash. However, I haven't figured out how to extract the cpu time from the top program. for example:
top is printing out:
top - 00:57:07 up 6:06, 2 users, load average: 0.46, 0.31, 0.55
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.7 us, 0.8 sy, 0.0 ni, 94.6 id, 0.9 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 1928720 total, 1738072 used, 190648 free, 57184 buffers
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3337 amarkovi 20 0 372m 31m 10m R 0.7 1.7 13:28.74 chromium-browse
all I want from this is the TIME+ field to be assigned to variable so I can add up the time and print it out by it self.
I am a noob to bash scripting so please be patient.
thanks,
Do you have to use top? It should be much simpler (once you work out the right options) to use ps to give you just the fields you want, then use grep to select just the processes you want.
Since it's an assigment i don't want to spoil all the fun :D. I'll just point you to some commands of which can help you in your endeavour : sed, awk and cut. With this 3 you can solve it in many ways, enjoy!

How to obtain the virtual private memory of a process from the command line under OSX?

I would like to obtain the virtual private memory consumed by a process under OSX from the command line. This is the value that Activity Monitor reports in the "Virtual Mem" column. ps -o vsz reports the total address space available to the process and is therefore not useful.
You can obtain the virtual private memory use of a single process by running
top -l 1 -s 0 -i 1 -stats vprvt -pid PID
where PID is the process ID of the process you are interested in. This results in about a dozen lines of output ending with
VPRVT
55M+
So by parsing the last line of output, one can at least obtain the memory footprint in MB. I tested this on OSX 10.6.8.
update
I realized (after I got downvoted) that #user1389686 gave an answer in the comment section of the OP that was better than my paltry first attempt. What follows is based on user1389686's own answer. I cannot take credit for it -- I've just cleaned it up a bit.
original, edited with -stats vprvt
As Mahmoud Al-Qudsi mentioned, top does what you want. If PID 8631 is the process you want to examine:
$ top -l 1 -s 0 -stats vprvt -pid 8631
Processes: 84 total, 2 running, 82 sleeping, 378 threads
2012/07/14 02:42:05
Load Avg: 0.34, 0.15, 0.04
CPU usage: 15.38% user, 30.76% sys, 53.84% idle
SharedLibs: 4668K resident, 4220K data, 0B linkedit.
MemRegions: 15160 total, 961M resident, 25M private, 520M shared.
PhysMem: 917M wired, 1207M active, 276M inactive, 2400M used, 5790M free.
VM: 171G vsize, 1039M framework vsize, 1523860(0) pageins, 811163(0) pageouts.
Networks: packets: 431147/140M in, 261381/59M out.
Disks: 487900/8547M read, 2784975/40G written.
VPRVT
8631
Here's how I get at this value using a bit of Ruby code:
# Return the virtual memory size of the current process
def virtual_private_memory
s = `top -l 1 -s 0 -stats vprvt -pid #{Process.pid}`.split($/).last
return nil unless s =~ /\A(\d*)([KMG])/
$1.to_i * case $2
when "K"
1000
when "M"
1000000
when "G"
1000000000
else
raise ArgumentError.new("unrecognized multiplier in #{f}")
end
end
Updated answer, thats work under Yosemite, from user1389686:
top -l 1 -s 0 -stats mem -pid PID

Resources