Calculate md5 on a single 1T file, or on 100 10G files, which one is faster? Or the speed are the same? - performance

I have a huge 1T file on my local machine and one on the remote server. I need to calculate their md5 to check if they are exactly the same. Since it will take long time to calculate md5 from them, I want to do some research on the md5 speed. I can calculate md5 directly against the whole file, or split it into 100 10G files and calculate md5 on them. I want to know which one is faster, or will they have the same speed?

As I was trying to say in the comments, it will depend on lots of things like the speed of your disk subsystem, your CPU performance and so on.
Here is an example. Create a 120GB file and check its size:
dd if=/dev/random of=junk bs=1g count=120
ls -lh junk
-rw-r--r-- 1 mark staff 120G 5 Oct 13:34 junk
Checksum in one go:
time md5sum junk
3c8fb0d5397be5a8b996239f1f5ce2f0 junk
real 3m55.713s <--- 4 minutes
user 3m28.441s
sys 0m24.871s
Checksum in 10GB chunks, with 12 CPU cores in parallel:
time parallel -k --pipepart --recend '' --recstart '' --block 10G -a junk md5sum
29010b411a251ff467a325bfbb665b0d -
793f02bb52407415b2bfb752827e3845 -
bf8f724d63f972251c2973c5bc73b68f -
d227dcb00f981012527fdfe12b0a9e0e -
5d16440053f78a56f6233b1a6849bb8a -
dacb9fb1ef2b564e9f6373a4c2a90219 -
ba40d6e7d6a32e03fabb61bb0d21843a -
5a5ee62d91266d9a02a37b59c3e2d581 -
95463c030b73c61d8d4f0e9c5be645de -
4bcd7d43849b65d98d9619df27c37679 -
92bc1f80d35596191d915af907f4d951 -
44f3cb8a0196ce37c323e8c6215c7771 -
real 1m0.046s <--- 1 minute
user 4m51.073s
sys 3m51.335s
It takes 1/4 of the time on my machine, but your mileage will vary... depending on your disk subsystem, your CPU etc.

Related

samtools calmd is pretty slow

I am using "samtools calmd" to add MD tag back to BAM file. The size of original BAM is around 50Gb (whole genome sequence by using pacbio HIFI reads). The issue that I encountered is that the speed of "calmd" is incredibly slow! The jobs have already run 12 hours, and only 600MB BAM with MD tag are generated. In this way, 50GB BAM will take 30days to be finished!
Here is the code I used to add MD tag (very normal):
rule addMDTag:
input:
rules.pbmm2_alignment.output
output:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Tmp/rawReads{readsIndex}.MD.bam"
params:
ref = strRef
threads:
16
log:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Log/rawReads{readsIndex}.MD.log"
benchmark:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Benchmark/rawReads{readsIndex}.MD.benchmark.txt"
shell:
"samtools calmd -# {threads} {input} {params.ref} -bAr > {output}"
The version of samtools I used is v1.10.
BTW, I use 16 cores to run calmd, however, it looks like the samtools is still using 1 core to run it:
top - 11:44:53 up 47 days, 20:35, 1 user, load average: 2.00, 2.01, 2.00
Tasks: 1723 total, 3 running, 1720 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 0.3%sy, 0.0%ni, 96.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 529329180k total, 232414724k used, 296914456k free, 84016k buffers
Swap: 12582908k total, 74884k used, 12508024k free, 227912476k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
93137 lix33 20 0 954m 151m 2180 R 100.2 0.0 659:04.13 samtools
May I know how to make calmd be much faster? Or is there any other tool that can do the same job more efficiently?
Thanks so much
After the collaboration with samtools maintenance team, this issue has been solved.
The calmd will be super slow if the bam was unsorted. Therefore, always make sure the BAM has been sorted before run calmd.
See the details below:
Are your files name sorted, and does your reference have more than one entry?
If so calmd will be switching between references all the time,
which means it may be doing a lot of reference loading and not much MD calculation.
You may find it goes a lot faster if you position-sort the input, and then run it through calmd.

Large result is slow anywhere but local

I have a fairly large query running on Clickhouse. The problem is when running on localhost using cmd line it takes about 0.7 sec to complete. This is consistently fast. Issue is when querying from C# / HTTP / Postman. Here it takes about 10 times to return the data. (the size is about 3-4mb) so I dont think its a size issue.
I have tried to monitor network latency, but nothing to notice here.
On the host it works like a charm, but outside it does not :(.... what to do.
I exptect the latency to be a few 100 ms, but turns out to be 7 sec :/
check timings with curl https://clickhouse.yandex/docs/en/interfaces/http/
https://stackoverflow.com/a/22625150
and compare local vs remote
CH HTTP usually provides almost the same performance as TCP and HTTP could be faster for small resultsets (like 10 rows)
Again. The problem is not the HTTP.
Example:
time clickhouse-client -q "select number, arrayMap(x->sipHash64(number,x), range(10)) from numbers(10000)" >native.out
real 0m0.034s
time curl -S -o http.out 'http://localhost:8123/?query=select%20number%2C%20arrayMap(x-%3EsipHash64(number%2Cx)%2C%20range(10))%20from%20numbers(10000)'
real 0m0.017s
ls -l http.out native.out
2108707 Oct 1 16:17 http.out
2108707 Oct 1 16:17 native.out
10 000 rows - 2Mb
HTTP is faster 0.017s VS 0.034s
Canada -> Germany (openvpn)
time curl -S -o http.out 'http://user:xxx#cl.host.x:8123/?query=select%20number%2C%20arrayMap(x-%3EsipHash64(number%2Cx)%2C%20range(10))%20from%20numbers(10000)'
real 0m1.619s
ping cl.host.x
PING cl.host.x (10.253.52.6): 56 data bytes
64 bytes from 10.253.52.6: icmp_seq=0 ttl=61 time=131.710 ms
64 bytes from 10.253.52.6: icmp_seq=1 ttl=61 time=133.711 ms

cwRsync, network drive, file modification time issue

I'm using cwRsync 5.4.1 x86 Free under windows and trying to sync folder to network drive.
I execute following command:
rsync.exe -rLtv --delete --ignore-errors "/cygdrive/d/1/" "/cygdrive/z/ZipNB/"
Where D is local drive and Z is network drive (external HDD connected to router, RT-N16)
Executing it several times gives same result:
>rsync.exe -rLtv --delete --ignore-errors "/cygdrive/d/1/" "/cygdrive/z/ZipNB/"
sending incremental file list
./
1.pdf
sent 11,893,922 bytes received 38 bytes 1,829,840.00 bytes/sec
total size is 11,890,918 speedup is 1.00
I have one file in the folder and it sends its content each execution. File is same each time and was not changed in the middle.
If I add additional parameter --size-only it works as expected:
>rsync.exe -rLtv --delete --ignore-errors --size-only "/cygdrive/d/1/" "/cygdrive/z/ZipNB/"
sending incremental file list
./
sent 72 bytes received 22 bytes 188.00 bytes/sec
total size is 11,890,918 speedup is 126,499.13
DIR for both directories:
D:\1>dir
Volume in drive D is XXX
Volume Serial Number is XXXX-XXX
Directory of D:\1
08.12.2016 10:04 <DIR> .
08.12.2016 10:04 <DIR> ..
24.11.2016 18:31 11 890 918 1.pdf
1 File(s) 11 890 918 bytes
Z:\ZipNB>dir
Volume in drive Z is BackUp (at Portable)
Volume Serial Number is XXXX-XXX
Directory of Z:\ZipNB
08.12.2016 10:04 <DIR> .
08.10.2016 20:40 <DIR> ..
24.11.2016 18:31 11 890 918 1.pdf
1 File(s) 11 890 918 bytes
I'm not sure but as I'm aware of rsync by default makes verification of file by modification time and size. Both files seems identical. But it seems like cwRsync for some reason gets/sets wrong modification date on file at Z drive. cwRsync works correctly if both directories are on local drive. It happens only with network drive.
In windows properties there is difference in modification time in 1 second, which can cause the problem.
I took only 1 file as an example only to simplify output, the situation is same heaving any amount of different files. It always sends full content of each file.
What can be wrong here, and how I can fix it?
I'm guessing the HDD on the network share uses FAT, because from File Times:
For example, the resolution of create time on FAT is 10 milliseconds,
while write time has a resolution of 2 seconds and access time has a
resolution of 1 day, so it is really the access date.
That would explain the time difference.
And for this kind of reason rsync added the --modify-window option:
-#, --modify-window
When comparing two timestamps, rsync treats the timestamps as being equal if they differ by no more than the modify-window value.
The default is 0, which matches just integer seconds. If you specify a
negative value (and the receiver is at least version 3.1.3) then
nanoseconds will also be taken into account. Specifying 1 is useful
for copies to/from MS Windows FAT filesystems, because FAT represents
times with a 2-second resolution (allowing times to differ from the
original by up to 1 second).
So try to add -#1 to your command.

Filter int, float and character from a text file in a Shell Script

Suppose I have a text file, which contains data like this.
Below output generated from du - sh /home/*
1.5G user1
2.5G user2
And so on...
Now if I want that those files size be stored in an array and compared to 5 GB if the user is consuming more than 5 Gb. What can I do?
The du command shows the usage of each folder in home directory. So if i want myself to be notified that some user is consuming more than 5 GB. Because there is a long list of users. It will be tedious to identify each user's disk usage. I want a shell script to identify the usage for each directory in home. And then I will put mail function to notify myself for exceeded limits.
Note : Don't want to implement quota as I just want to monitor the usage.
Use du's -t (--threshold) option to specify you only want to know about directories with more than a certain amount of data in them:
$ du -sh -t 5G /home/*
If you're picky about precisely how big a gigabyte is, note that 5G uses multiples of 1024; you may prefer -t 5GB for multiples of 1000, or even -t 5000M to mix them.
For lots of users, you're probably better off writing that using -d 1 instead of -s to avoid the shell having to expand the * into a very long list:
$ du -h -d 1 -t 5G /home/

How to zgrep the last line of a gz file without tail

Here is my problem, I have a set of big gz log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data.
For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.
Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.
You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
Random access to blocked XZ format (BXZF) – xz with improved random access support
Experiments with xz
xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.
File creation
xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:
split -b 50M --filter 'xz -c' big.log > big.log.sp.xz
This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.
To do this without GNU, you'd need a loop:
split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*
Parsing
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:
SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Side note
Version 5.1.1 introduced support for the --block-size flag:
xz --block-size=50M big.log
However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with gzip
gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.
This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.
Experiments with bzip2
bzip2 has headers we can find. This is still a bit messy, but it works.
Creation
This is just like the xz procedure above:
split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2
I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.
Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.
Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.
Perspective
Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.
————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00
Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).
Well, you can access randomly a gzipped file if you previously create an index for each file ...
I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them:
https://github.com/circulosmeos/gztool
The tool has two options that may be of interest for you:
-S option supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
-t tails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1
Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

Resources