bash script with parallel execution - bash

I am trying to use parallel in a bash script, to verify if s3 path exists or not and I am trying to verify multiple s3 paths, by counting the objects in the path. If the count of the object is zero it will continue to the next date in the for loop, with parallel it is not working as expected.
For Date range I provided in the for loop, we actually don't have those folders in the s3bucket, and in the function checkS3Path if s3 path doesnt exists, I am creating a 0KB file, but I dont see those 0KB files being created after script is executed. From the output of the script, I am seeing S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-03, instead of S3 Path Doesnt Exists folder1:+2019-10-03. Please see the output below.
please let me what might be the issue.
Here is the sample code.
#!/bin/bash
#set -x
s3Bucket=testbucket
version=v20
Array=(folder1 folder2 folder3)
checkS3Path() {
fldName=$1
date=$2
objectNum=$(aws s3 ls s3://${s3Bucket}/${version}/${fldName}/date=${date}/ | wc -l)
echo $objectNum
if [ "$objectNum" -eq 0 ]
then
echo "S3 Path Doesnt Exists ${fldName}:${date}" >> /app/${fldName}.log
touch /home/ubuntu/${fldName}_${date}.txt
continue
else
echo "S3 Path Consists csv Files, Proceeding to next step ${fldName}:${date}"
fi
}
final() {
fldName=$1
date=$2
checkS3Path $fldName $date
function2 $fldName $date
function3 $fldName $date
}
export -f final checkS3Path
for date in 2019-10-{01..03}
do
# finalstep folder1 $date
parallel --jobs 4 --eta finalstep ::: "${Array[#]}" ::: +"$date"
done
Here is the output I am seeing.
$ ./test.sh
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
To silence this citation notice: run 'parallel --citation'.
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 14 AVG: 0.00s local:4/0/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-01
ETA: 0s Left: 13 AVG: 0.00s local:4/1/100%/2.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder2:+2019-10-01
ETA: 0s Left: 12 AVG: 0.00s local:4/2/100%/1.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder3:+2019-10-01
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
To silence this citation notice: run 'parallel --citation'.
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 14 AVG: 0.00s local:4/0/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-02
ETA: 0s Left: 13 AVG: 0.00s local:4/1/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder2:+2019-10-02
ETA: 6s Left: 12 AVG: 0.50s local:4/2/100%/0.5s 202
S3 Path Consists CSV Files, Proceeding to next step folder3:+2019-10-02
ETA: 3s Left: 11 AVG: 0.33s local:4/3/100%/0.3s 202
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
To silence this citation notice: run 'parallel --citation'.
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 14 AVG: 0.00s local:4/0/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-03
ETA: 0s Left: 13 AVG: 0.00s local:4/1/100%/1.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder2:+2019-10-03
ETA: 0s Left: 12 AVG: 0.00s local:4/2/100%/0.5s 202
S3 Path Consists CSV Files, Proceeding to next step folder3:+2019-10-03
ETA: 0s Left: 11 AVG: 0.00s local:4/3/100%/0.3s 202
$
Thanks

If checkS3Path works when run by hand, then you probably just need to:
export s3Bucket=testbucket
export version=v20
Each GNU Parallel job runs in its own shell (started from Perl) which is the reason you need to export variables, if you want them to be visible to the job.
Also look at env_parallel to do this automatically.

Related

jupyter notebook %%time doesn't measure cpu time of %%sh commands?

When I run python code in a jupyter-lab (v3.4.3) ipython notebook (v8.4.0) and use the %%time cell magic, both cpu time and wall time are reported.
%%time
for i in range(10000000):
a = i*i
CPU times: user 758 ms, sys: 121 µs, total: 758 ms
Wall time: 757 ms
But when the same computation is performed using the %%sh magic to run a shell script, the cpu time results are nonsense.
%%time
%%sh
python -c "for i in range(10000000): a = i*i"
CPU times: user 6.14 ms, sys: 12.5 ms, total: 18.6 ms
Wall time: 920 ms
The docs for %time do say "Time execution of a Python statement or expression.", but this still surprised me because I had assumed that the shell script will run in a python subprocess and thus can also be measured. So, what's going on here? Is this a bug, or just a known caveat of using %%sh?
I know I can use the shell builtin time or /usr/bin/time to get similar output, but this is a bit cumbersome for multiple lines of shell---is there a better workaround?

samtools calmd is pretty slow

I am using "samtools calmd" to add MD tag back to BAM file. The size of original BAM is around 50Gb (whole genome sequence by using pacbio HIFI reads). The issue that I encountered is that the speed of "calmd" is incredibly slow! The jobs have already run 12 hours, and only 600MB BAM with MD tag are generated. In this way, 50GB BAM will take 30days to be finished!
Here is the code I used to add MD tag (very normal):
rule addMDTag:
input:
rules.pbmm2_alignment.output
output:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Tmp/rawReads{readsIndex}.MD.bam"
params:
ref = strRef
threads:
16
log:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Log/rawReads{readsIndex}.MD.log"
benchmark:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Benchmark/rawReads{readsIndex}.MD.benchmark.txt"
shell:
"samtools calmd -# {threads} {input} {params.ref} -bAr > {output}"
The version of samtools I used is v1.10.
BTW, I use 16 cores to run calmd, however, it looks like the samtools is still using 1 core to run it:
top - 11:44:53 up 47 days, 20:35, 1 user, load average: 2.00, 2.01, 2.00
Tasks: 1723 total, 3 running, 1720 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 0.3%sy, 0.0%ni, 96.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 529329180k total, 232414724k used, 296914456k free, 84016k buffers
Swap: 12582908k total, 74884k used, 12508024k free, 227912476k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
93137 lix33 20 0 954m 151m 2180 R 100.2 0.0 659:04.13 samtools
May I know how to make calmd be much faster? Or is there any other tool that can do the same job more efficiently?
Thanks so much
After the collaboration with samtools maintenance team, this issue has been solved.
The calmd will be super slow if the bam was unsorted. Therefore, always make sure the BAM has been sorted before run calmd.
See the details below:
Are your files name sorted, and does your reference have more than one entry?
If so calmd will be switching between references all the time,
which means it may be doing a lot of reference loading and not much MD calculation.
You may find it goes a lot faster if you position-sort the input, and then run it through calmd.

Scripting a clamscan summary that adds multiple "Infected files" outputs together

I want a simple way to add 2 numbers taken from a text file. Details below:
Daily, I run clamscan against my /home/ folder, which generates a simple log along the lines of this:
Scanning 851M in /home/.
----------- SCAN SUMMARY -----------
Infected files: 0
Time: 0.000 sec (0 m 0 s)
Start Date: 2021:11:27 06:25:02
End Date: 2021:11:27 06:25:02
Weekly, I scan both my /home/ folder and an external drive, so I get twice as much in the log:
Scanning 851M in /home/.
----------- SCAN SUMMARY -----------
Infected files: 0
Time: 0.000 sec (0 m 0 s)
Start Date: 2021:11:28 06:25:02
End Date: 2021:11:28 06:25:02
Scanning 2.8T in /mnt/ext/.
----------- SCAN SUMMARY -----------
Infected files: 0
Time: 0.005 sec (0 m 0 s)
Start Date: 2021:11:28 06:26:30
End Date: 2021:11:28 06:26:30
I don't email the log to myself, I just have a bash script that sends an email that (for the daily scan) reads the number that comes after "Infected files:" and says either "No infected files found" or "Infected files found, check log." (And, to be honest, once I'm 100% comfortable that it all works the way I want it to, I'll skip the "No infected files found" email.) The problem is, I don't know how to make that work for the weekly scan of multiple folders, because the summary I get doesn't combine those numbers.
I'd like the script to find both lines that start "Infected files:", get the numbers that follow, and add them. I guess the ideal solution use a loop in case I ever need to scan more than two folders. I've taken a couple of stabs at it with grep and cut, but I'm just not experienced enough a coder to make it all work.
Thanks!
This bash script will print out the sum of infected files:
#!/bin/bash
n=$(sed -n 's/^Infected files://p' logfile)
echo $((${n//$'\n'/+}))
or a one-liner:
echo $(( $(sed -n 's/^Infected files: \(.*\)/+\1/p' logfile) ))

Calculate md5 on a single 1T file, or on 100 10G files, which one is faster? Or the speed are the same?

I have a huge 1T file on my local machine and one on the remote server. I need to calculate their md5 to check if they are exactly the same. Since it will take long time to calculate md5 from them, I want to do some research on the md5 speed. I can calculate md5 directly against the whole file, or split it into 100 10G files and calculate md5 on them. I want to know which one is faster, or will they have the same speed?
As I was trying to say in the comments, it will depend on lots of things like the speed of your disk subsystem, your CPU performance and so on.
Here is an example. Create a 120GB file and check its size:
dd if=/dev/random of=junk bs=1g count=120
ls -lh junk
-rw-r--r-- 1 mark staff 120G 5 Oct 13:34 junk
Checksum in one go:
time md5sum junk
3c8fb0d5397be5a8b996239f1f5ce2f0 junk
real 3m55.713s <--- 4 minutes
user 3m28.441s
sys 0m24.871s
Checksum in 10GB chunks, with 12 CPU cores in parallel:
time parallel -k --pipepart --recend '' --recstart '' --block 10G -a junk md5sum
29010b411a251ff467a325bfbb665b0d -
793f02bb52407415b2bfb752827e3845 -
bf8f724d63f972251c2973c5bc73b68f -
d227dcb00f981012527fdfe12b0a9e0e -
5d16440053f78a56f6233b1a6849bb8a -
dacb9fb1ef2b564e9f6373a4c2a90219 -
ba40d6e7d6a32e03fabb61bb0d21843a -
5a5ee62d91266d9a02a37b59c3e2d581 -
95463c030b73c61d8d4f0e9c5be645de -
4bcd7d43849b65d98d9619df27c37679 -
92bc1f80d35596191d915af907f4d951 -
44f3cb8a0196ce37c323e8c6215c7771 -
real 1m0.046s <--- 1 minute
user 4m51.073s
sys 3m51.335s
It takes 1/4 of the time on my machine, but your mileage will vary... depending on your disk subsystem, your CPU etc.

Enabling tempcomp in Chrony

I'm working on a Raspberry Pi running Buster with an Adafruit Ultimate GPS hat. I'm trying to get Chrony to temperature compensate. I've modified the chrony.conf file to contain.
# Uncomment the following line to turn logging on.
log measurements refclocks statistics tempcomp tracking
tempcomp /sys/class/hwmon/hwmon0/temp1_input 30 26000 0.0 0.000183 0.0
#tempcomp /sys/class/hwmon/hwmon0/temp1_input 30 /var/log/chrony/tempcomp.log
The system is currently adding measurements to the tempcomp.log file every 30 seconds. However, if I enable the second (commented out) tempcomp line above, chrony dies on restart with the error
Sep 6 12:31:45 rpi-tick2 chronyd[24713]: chronyd version 3.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 -DEBUG)
Sep 6 12:31:45 rpi-tick2 chronyd[24713]: Frequency 23.662 +/- 0.165 ppm read from /var/lib/chrony/chrony.drift
Sep 6 12:31:45 rpi-tick2 chronyd[24713]: Fatal error : Could not read tempcomp point from /var/log/chrony/tempcomp.log
Sep 6 12:31:45 rpi-tick2 chronyd[24711]: Could not read tempcomp point from /var/log/chrony/tempcomp.log
I believe this is due to the fact that the tempcomp.log files has entries like
===========================================
Date (UTC) Time Temp. Comp.
===========================================
2021-09-06 17:40:47 5.2095e+04 4.7754e+00
2021-09-06 17:41:17 5.2582e+04 4.8645e+00
2021-09-06 17:41:47 5.2582e+04 4.8645e+00
2021-09-06 17:42:17 5.3069e+04 4.9536e+00
2021-09-06 17:42:47 5.2582e+04 4.8645e+00
Where chrony is expecting something like
20000 1.0
21000 0.64
22000 0.36
23000 0.16
24000 0.04
and sorted by temperature not sample time.
So it seems like I'm missing a step somewhere.
Also, once set up, is this a dynamic process where new datapoints are added as we go, or do we stop collecting data and just use the static table to compensate for temps?
Thanks for any insights.
I think you have to manually create a chrony.tempcomp file, likely by analyzing the tempcomp.log file. They are separate files. Then specify the chrony.tempcomp file like this:
tempcomp /sys/class/hwmon/hwmon0/temp2_input 30 /etc/chrony.tempcomp

Resources