Parallel processing or multi-threading in Shell scripting - bash

I am writing a script in shell to pull files from application using curl call, and to pull 100 files it is taking more than 30 minutes.
I want split this into multiple chunks and do a parallel curl call (eg: 10 files each call).
I am new in parallel processing/threading.

Q : "I want ... do a parallel curl call ..."
For all possible syntax details, start with man parallel
Next, one may use also some curl-tricks, yet only those that do not collide with the parallel syntax-elements.
As the fileIO-ops are both slow and bear rather a high ( yet maskable ) latency, a number of concurrent processes might grow well high :
parallel --jobs 24 \
--dry-run \
curl \
ftps://a.b.c.d/node7-{1}/{1}-{2}/{1}-{2}-{3}-{4}_*.jpg \
::: "LKLN" "LKRO" "LKPM" \
::: $( seq -f "%04g" 2011 2020 ) \
::: $( seq -f "%02g" 4 6 ) \
::: $( seq -f "%02g" 1 31 ) \
will yield a demo of :
...
curl ftps://a.b.c.d/node7-LKLN/LKLN-2020/LKLN-2020-06-29_*.jpg
curl ftps://a.b.c.d/node7-LKLN/LKLN-2020/LKLN-2020-06-30_*.jpg
curl ftps://a.b.c.d/node7-LKLN/LKLN-2020/LKLN-2020-06-31_*.jpg
curl ftps://a.b.c.d/node7-LKRO/LKRO-2011/LKRO-2011-04-01_*.jpg
curl ftps://a.b.c.d/node7-LKRO/LKRO-2011/LKRO-2011-04-02_*.jpg
curl ftps://a.b.c.d/node7-LKRO/LKRO-2011/LKRO-2011-04-03_*.jpg
...
all split among the said 24 parallel-orchestrated processes
Finally feel free to adapt your scripting strategy so as to meet and match your actual transport, storage, processor and memory capacities, logging and self-reporting needs.
And you became a next master of the parallel-orchestrated processing.
All credits, since 2007, go to Ole Tange!

Related

Limit Speed in Ubuntu based on traffic

I have come across a script i.e. Wondershaper
The script is terrific, however any way to make it smarter?
Like it runs after certain traffic has gone through?
Say 1TB is set per day, once 1TB is hit, the script turns on automatically?
I have thought about setting crn job,
At 12 am it clears the wondershaper, and in 15mins interval, it checks if the server has crossed 1TB limit for the day, and then if it is true then it runs the limiter,
but I am not sure how to set up the 2nd part, how can i setup a way that will enable the limiter to run after 1TB is crossed?
Remove Code
wondershaper -ca eth0
Limit Code
wondershaper -a eth0 -u 154000
I have made a custom script for this, as it is not possible to do it within the system, i had to become creative and do a API call to the datacenter and then run cron job.
I also used bashjson, to run it. I have attached the script below.
date=$(date +%F)
url='API URL /metrics/datatraffic?from='
url1='T00:00:00Z&to='
url2='T23:59:59Z&aggregation=SUM'
final="$url$date$url1$date$url2"
wget --no-check-certificate -O output.txt \
--method GET \
--timeout=0 \
--header 'X-Lsw-Auth: API AUTH' \
$final
sed 's/[][]//g' output.txt >> test1.json // will remove '[]' from the code just to make things easier for bashjson to understand
down=$(/root/bashjson/bashjson.sh test1.json metrics DOWN_PUBLIC values value) // outputs the data to variable
up=$(/root/bashjson/bashjson.sh test1.json metrics UP_PUBLIC values value)
newdown=$(printf "%.14f" $down)
newup=$(printf "%.14f" $up)
upp=$(printf "%.0f\n" "$newup") // removes scientific notation as bash does not like it
downn=$(printf "%.0f\n" "$newdown")
if (($upp>800000000000 | bc))
then
wondershaper -a eth0 -u 100000 //main command to limit
else
echo uppworks
fi
if (($downn>500000000000 | bc))
then
wondershaper -a eth0 -d 100000
else
echo downworks
fi
rm -rf output.txt test1.json
echo $upp
echo $downn
You can always update it as per your preference.

How to use parallel with curl?

How do i use gnu parallel to make this process faster ?
#!/bin/bash
for (( c=1; c<=100; c++ ))
do
curl -sS 'https://example.com' \
--data 'value='$c'' /dev/null
echo $c
done
You can use parallel, or xargs
seq 100 | parallel curl -sS 'https://example.com' --data value='{}' /dev/null
seq 100 | xargs -I{} curl -sS 'https://example.com' --data value='{}' /dev/null
As the script stand, output will be sent to stdout. With xargs, this will result in output from different calls potentially mixed. Consider redirect output to files for additional processing, if needed.
You can add options for max parallel (-Pn, etc.) as needed
I'm not sure why '/dev/null' is needed. Consider reordering:
curl -sS --data value='{}' https://example.com'

Run million of list in PBS with parallel tool

I've huge size(few million) job contain list and wants to run java written tool to perform the features comparison. This tool completes the calculation in
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
Running 5 nodes(each have 72 cpus) with pbs torque scheduler in the GNU parallel, tool runs fine and produces the results but as I set 72 jobs per node, it should run 72 x 5 jobs at a time but I can see only it runs 25-35 jobs!
Checking of cpu utilization on each node also shows low utilization.
I desire to run 72 X 5 jobs or more at a time and produce the results by utilizing all the available source (72 X 5 cpus).
As I mentioned have ~200 millions of job to run, I desire to complete it faster(1-2 hours) by using/increasing the number of nodes/cpus.
Current code, input and job state:
example.lst (it has ~300 million lines)
ZNF512-xxxx_2_N-THRA-xxtx_2_N
ZNF512-xxxx_2_N-THRA-xxtx_3_N
ZNF512-xxxx_2_N-THRA-xxtx_4_N
.......
cat job_script.sh
#!/bin/bash
#PBS -l nodes=5:ppn=72
#PBS -N job01
#PBS -j oe
#work dir
export WDIR=/shared/data/work_dir
cd $WDIR;
# use available 72 cpu in each node
export JOBS_PER_NODE=72
#gnu parallel command
parallelrun="parallel -j $JOBS_PER_NODE --slf $PBS_NODEFILE --wd $WDIR --joblog process.log --resume"
$parallelrun -a example.lst sh run_script.sh {}
cat run_script.sh
#!/bin/bash
# parallel command options
i=$1
data=/shared/TF_data
# create tmp dir and work in
TMP_DIR=/shared/data/work_dir/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# get file name
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
#run a tool to compare the features of pair files
/shared/software/tool_v2.1/tool -s1 $data/inf_tf/$mk -s1cf $data/features/$mk-cf -s1ss $data/features/$mk-ss -s2 $data/inf_tf/$nk.pdb -s2cf $data/features/$nk-cf.pdb -s2ss $data/features/$nk-ss.pdb > $data/$i.out
# move output files
mv matrix.txt $data/glosa_tf/matrix/$mk"_"$nk.txt
mv ali_struct.pdb $data/glosa_tf/aligned/$nk"_"$mk.pdb
# move back and remove tmp dir
cd $TMP_DIR/../
rm -rf $TMP_DIR
exit 0
PBS submission
qsub job_script.sh
Login to one of the node : ssh ip-172-31-9-208
top - 09:28:03 up 15 min, 1 user, load average: 14.77, 13.44, 8.08
Tasks: 928 total, 1 running, 434 sleeping, 0 stopped, 166 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 98.4%id, 1.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 193694612k total, 1811200k used, 191883412k free, 94680k buffers
Swap: 0k total, 0k used, 0k free, 707960k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15348 ec2-user 20 0 16028 2820 1820 R 0.3 0.0 0:00.10 top
15621 ec2-user 20 0 169m 7584 6684 S 0.3 0.0 0:00.01 ssh
15625 ec2-user 20 0 171m 7472 6552 S 0.3 0.0 0:00.01 ssh
15626 ec2-user 20 0 126m 3924 3492 S 0.3 0.0 0:00.01 perl
.....
All of the nodes top shows the similar state and produces the results by running only ~26 at a time!
I've aws-parallelcluster contains 5 nodes(each have 72 cpus) with torque scheduler and GNU Parallel 2018, Mar 2018
Update
By introducing the new function that takes input on stdin and running the script in parallel works great and utilizes all the CPU in local machine.
However, when its runs over remote machines it produces a
parallel: Error: test.lst is neither a file nor a block device
MCVE:
A simple code that echoing list gives the same error while running it in remote machines but works great in local machine:
cat test.lst # contains list
DNMT3L-5yx2B_1_N-DNMT3L-5yx2B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6brrC_3_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57C_2_N
DNMT3L-5yx2B_1_N-DUX4-6e8cA_4_N
DNMT3L-5yx2B_1_N-E2F8-4yo2A_3_P
DNMT3L-5yx2B_1_N-E2F8-4yo2A_6_N
DNMT3L-5yx2B_1_N-EBF3-3n50A_2_N
DNMT3L-5yx2B_1_N-ELK4-1k6oA_3_N
DNMT3L-5yx2B_1_N-EPAS1-1p97A_1_N
cat test_job.sh # GNU parallel submission script
#!/bin/bash
#PBS -l nodes=1:ppn=72
#PBS -N test
#PBS -k oe
# introduce new function and Run from ~/
dowork() {
parallel sh test_work.sh {}
}
export -f dowork
parallel -a test.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
cat test_work.sh # run/work script
#!/bin/bash
i=$1
data=pwd
#create temporary folder in current dir
TMP_DIR=$data/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# split list
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
# echo list and save in echo_test.out
echo $mk, $nk >> $data/echo_test.out
cd $TMP_DIR/../
rm -rf $TMP_DIR
From your timing:
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
it seems the tool uses very little CPU power. When GNU Parallel runs local jobs it has an overhead of 10 ms CPU time per job. Your jobs use 179 ms time, and 5 ms CPU time. So GNU Parallel will be using quite a bit of the time spent.
The overhead is much worse when running jobs remotely. Here we are talking 10 ms + running an ssh command. This can easily be in the order of 100 ms.
So how can we minimize the number of ssh commands and how can spread the overhead over multiple cores?
First let us make a function that can take input on stdin and run the script - one job per CPU thread in parallel:
dowork() {
[...set variables here. that becomes particularly important we when run remotely...]
parallel sh run_script.sh {}
}
export -f dowork
Test that this actually works by running:
head -n 1000 example.lst | dowork
Then let us look at running jobs locally. This can be done similar to described here: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
parallel -a example.lst --pipepart --block -10 dowork
This will split example.lst into 10 blocks per CPU thread. So on a machine with 72 CPU threads this will make 720 blocks. It will the start 72 doworks and when one is done it will get another of the 720 blocks. The reason I choose 10 instead of 1 is if one of the jobs "get stuck" for a while, then you are unlikely to notice this.
This should make sure 100% of the CPUs on the local machine is busy.
If that works, we need to distribute this work to remote machines:
parallel -j1 -a example.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
This should in total start 10 ssh per CPU thread (i.e. 5*72*10) - namely one for each block. With 1 running per server listed in $PBS_NODEFILE in parallel.
Unfortunately this means that --joblog and --resume will not work. There is currently no way to make that work, but if it is valuable to you contact me via parallel#gnu.org.
I am not sure what tool does. But if the copying takes most of the time and if tool only reads the files, then you might just be able symlink the files into $TMP_DIR instead of copying.
A good indication of whether you can do it faster is to look at top of the 5 machines in the cluster. If they are all using all cores at >90% then you cannot expect to get it faster.

Run multiple curl commands in parallel

I have the following shell script. The issue is that I want to run the transactions parallel/concurrently without waiting for one request to finish to go to the next request. For example if I make 20 requests, I want them to be executed at the same time.
for ((request=1;request<=20;request++))
do
for ((x=1;x<=20;x++))
do
time curl -X POST --header "http://localhost:5000/example"
done
done
Any guide?
You can use xargs with -P option to run any command in parallel:
seq 1 200 | xargs -n1 -P10 curl "http://localhost:5000/example"
This will run curl command 200 times with max 10 jobs in parallel.
Using xargs -P option, you can run any command in parallel:
xargs -I % -P 8 curl -X POST --header "http://localhost:5000/example" \
< <(printf '%s\n' {1..400})
This will run give curl command 400 times with max 8 jobs in parallel.
Update 2020:
Curl can now fetch several websites in parallel:
curl --parallel --parallel-immediate --parallel-max 3 --config websites.txt
websites.txt file:
url = "website1.com"
url = "website2.com"
url = "website3.com"
This is an addition to #saeed's answer.
I faced an issue where it made unnecessary requests to the following hosts
0.0.0.1, 0.0.0.2 .... 0.0.0.N
The reason was the command xargs was passing arguments to the curl command. In order to prevent the passing of arguments, we can specify which character to replace the argument by using the -I flag.
So we will use it as,
... xargs -I '$' command ...
Now, xargs will replace the argument wherever the $ literal is found. And if it is not found the argument is not passed. So using this the final command will be.
seq 1 200 | xargs -I $ -n1 -P10 curl "http://localhost:5000/example"
Note: If you are using $ in your command try to replace it with some other character that is not being used.
Adding to #saeed's answer, I created a generic function that utilises function arguments to fire commands for a total of N times in M jobs at a parallel
function conc(){
cmd=("${#:3}")
seq 1 "$1" | xargs -n1 -P"$2" "${cmd[#]}"
}
$ conc N M cmd
$ conc 10 2 curl --location --request GET 'http://google.com/'
This will fire 10 curl commands at a max parallelism of two each.
Adding this function to the bash_profile.rc makes it easier. Gist
Add “wait” at the end, and background them.
for ((request=1;request<=20;request++))
do
for ((x=1;x<=20;x++))
do
time curl -X POST --header "http://localhost:5000/example" &
done
done
wait
They will all output to the same stdout, but you can redirect the result of the time (and stdout and stderr) to a named file:
time curl -X POST --header "http://localhost:5000/example" > output.${x}.${request}.out 2>1 &
Wanted to share my example how I utilised parallel xargs with curl.
The pros from using xargs that u can specify how many threads will be used to parallelise curl rather than using curl with "&" that will schedule all let's say 10000 curls simultaneously.
Hope it will be helpful to smdy:
#!/bin/sh
url=/any-url
currentDate=$(date +%Y-%m-%d)
payload='{"field1":"value1", "field2":{},"timestamp":"'$currentDate'"}'
threadCount=10
cat $1 | \
xargs -P $threadCount -I {} curl -sw 'url= %{url_effective}, http_status_code = %{http_code},time_total = %{time_total} seconds \n' -H "Content-Type: application/json" -H "Accept: application/json" -X POST $url --max-time 60 -d $payload
.csv file has 1 value per row that will be inserted in json payload
Based on the solution provided by #isopropylcyanide and the comment by #Dario Seidl, I find this to be the best response as it handles both curl and httpie.
# conc N M cmd - fire (N) commands at a max parallelism of (M) each
function conc(){
cmd=("${#:3}")
seq 1 "$1" | xargs -I'$XARGI' -P"$2" "${cmd[#]}"
}
For example:
conc 10 3 curl -L -X POST https://httpbin.org/post -H 'Authorization: Basic dXNlcjpwYXNz' -H 'Content-Type: application/json' -d '{"url":"http://google.com/","foo":"bar"}'
conc 10 3 http --ignore-stdin -F -a user:pass httpbin.org/post url=http://google.com/ foo=bar

Create a new batch .txt file with specified content for every current file in a directory

I have a huge list of files on a cluster and I need to create a .txt file for each "pair". Each pair is specified by filename_R1.fq.gz and filename_R2.fq.gz. for each pair of R1 and R2 files I need to create a text file that contains:
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \
--phred33 \
--fast-local \
-X 1000 \
-p 12 \
-x /usr3/graduate/dhc285/reference_files/21G6 \
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \
| samtools view -bS - > ${i%_R1.fq.gz}.bam
Where the $i command refers to my filenames. I would also like each file to be named ${i%_R1.fq.gz}.txt. Thanks!
Using GNU Parallel it looks like this:
sge_jobfile() {
i="$1"
cat <<EOF > ${i%_R1.fq.gz}.txt
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \\
--phred33 \\
--fast-local \\
-X 1000 \\
-p 12 \\
-x /usr3/graduate/dhc285/reference_files/21G6 \\
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \\
| samtools view -bS - > ${i%_R1.fq.gz}.bam
EOF
}
export -f sge_jobfile
parallel sge_jobfile ::: *_R1.fq.gz
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Resources