Efficient method for parallel processing in Bash /Shell ? - bash

I have a text file ( Input.txt ) containing domains and that is total of about 35 Millions domains.
#Input.txt
google.com
cnn.com
bbc.com
........
Now ,I have a python script to check the status code of each and every domains associated with in the text file ( Input.txt ). For smaller set, I do
for i in $(cat Input.txt);do python status_check.py $i;done > out_file.txt
If i process in this manner,It might take ages to check the status code for all 35 million domains.
I'm not familiar in parallel processing. Can some one help me on,How to achieve the task by saving time using shell/bash/any ?

You are looking for GNU Parallel:
cat Input.txt | parallel -j 100 python status_check.py > out_file.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Put an ampersand after your $1 and it will run each "concurrently"
Bash is probably not the right tool to do this. Each fork is very expensive resource-wise. You'd be better off using Ruby or Python, reading this into an array and then processing it inside the interpreter's VM.

Why not alter your python script to read the URLs itself and then distribute the processing?
It seems a bit pointless having a bash for-loop when you could just do that in python.
There are a number of modules in python for handling parallel processing listed here.

Related

Issue with download multiple file with names in BASH

I'm trying to download multiple files in parallel using xargs. Things worked so well if I only download the file without given name. echo ${links[#]} | xargs -P 8 -n 1 wget. Is there any way that allow me to download with filename like wget -O [filename] [URL] but in parallel?
Below is my work. Thank you.
links=(
"https://apod.nasa.gov/apod/image/1901/sombrero_spitzer_3000.jpg"
"https://apod.nasa.gov/apod/image/1901/orionred_WISEantonucci_1824.jpg"
"https://apod.nasa.gov/apod/image/1901/20190102UltimaThule-pr.png"
"https://apod.nasa.gov/apod/image/1901/UT-blink_3d_a.gif"
"https://apod.nasa.gov/apod/image/1901/Jan3yutu2CNSA.jpg"
)
names=(
"file1.jpg"
"file2.jpg"
"file3.jpg"
"file4.jpg"
"file5.jpg"
)
echo ${links[#]} ${names[#]} | xargs -P 8 -n 1 wget
With GNU Parallel you can do:
parallel wget -O {2} {1} ::: "${links[#]}" :::+ "${names[#]}"
If a download fails, GNU Parallel can also retry commands with --retry 3.

How to Install GNU Parallel on Windows 10 using git-bash

Has anyone been able to successfully use GNU Parallel on Windows 10 with git-bash? Is it possible? - If so, how?
Background:
I'm having trouble installing GNU Parallel and using it, and it got me thinking - maybe git-bash is holding me back? I'm sure if I installed Ubuntu through WSL I wouldn't have any problems running GNU Parallel. But I wanted to know if I could do this in git-bash first.
I just installed git-bash on a Microsoft Windows 10 machine and had no problems installing GNU Parallel.
It is by no means well tested on git-bash, but basic functionality clearly works.
I'm having trouble installing GNU Parallel
Maybe you can post the error you get when running:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh

GNU parallel not spawning jobs

After an upgrade to Debian 8.6 Jessie the GNU parallel script suddenly stopped parallelizing to more than 2 jobs with the --pipe and -L options.
Before the upgrade the command:
cat file_with_1064_lines.txt | parallel -L10 -j5 -k -v --pipe "wc -l"
spawned 5 processes, which output this:
wc -l
10
wc -l
10
...
The same command after the upgrade:
wc -l
1060
wc -l
4
(The two values above change with respect to the -L option value -- the first is L*floor(1064/L) and the second is 1064 mod L, but there always only two processes outputting.)
The same is observed independently of the parallel version (tested the latest and one from 2013).
PS.
$ uname -a
Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
$ parallel --version
GNU parallel 20161222
-L is the record size. The bug was fixed around 20130122. What you want is to read 1 record of 10 lines:
parallel -L10 -N1 -j5 -k -v --pipe wc -l
or 10 records of 1 line:
parallel -L1 -N10 -j5 -k -v --pipe wc -l

how to ping each ip in a file?

I have a file named "ips" containing all ips I need to ping. In order to ping those IPs, I use the following code:
cat ips|xargs ping -c 2
but the console show me the usage of ping, I don't know how to do it correctly. I'm using Mac os
You need to use the option -n1 with xargs to pass one IP at time as ping doesn't support multiple IPs:
$ cat ips | xargs -n1 ping -c 2
Demo:
$ cat ips
127.0.0.1
google.com
bbc.co.uk
$ cat ips | xargs echo ping -c 2
ping -c 2 127.0.0.1 google.com bbc.co.uk
$ cat ips | xargs -n1 echo ping -c 2
ping -c 2 127.0.0.1
ping -c 2 google.com
ping -c 2 bbc.co.uk
# Drop the UUOC and redirect the input
$ xargs -n1 echo ping -c 2 < ips
ping -c 2 127.0.0.1
ping -c 2 google.com
ping -c 2 bbc.co.uk
With ip or hostname in each line of ips file:
( while read ip; do ping -c 2 $ip; done ) < ips
You can also change timeout, with -W flag, so if some hosts isn'up, it wont lock your script for too much time. Also -q for quiet output is useful in this case.
( while read ip; do ping -c1 -W1 -q $ip; done ) < ips
If the file is 1 ip per line (and it's not overly large), you can do it with a for loop:
for ip in $(cat ips); do
ping -c 2 $ip;
done
You could use fping. It also does that in parallel and has script friendly output.
$ cat ips | xargs fping -q -C 3
10.xx.xx.xx : 201.39 203.62 200.77
10.xx.xx.xx : 288.10 287.25 288.02
10.xx.xx.xx : 187.62 187.86 188.69
...
With GNU Parallel you would do:
parallel -j0 ping -c 2 {} :::: ips
This will run as many jobs in parallel as you have ips or processes.
It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Try doing this :
cat ips | xargs -i% ping -c 2 %
As suggested by #Lupus you can use "fping", but the output is not human friendly - it will scroll out of your screen in few seconds laving you with no trace as of what is going on. To address this I've just released ping-xray. I tried to make it as visual as possible under ascii terminal plus it creates CSV logs with exact millisecond resolution for all targets.
https://dimon.ca/ping-xray/
Hope you'll find it helpful.

How to check broken links in a webpage?

I maintained a list of links to some resources in my blog.
If I find a link is broken, I add a class="broken" to it.
Sometimes the broken links go to alive again, so I remove the class="broken".
When the list goes very long, it's very hard to check them one bye one.
<ul>
<li>a</li>
<li>b</li>
<li>c</li>
<li>d</li>
</ul>
How to write a bash script to do the editing?
Maybe it's not the answer you're looking for, but why doing it from bash, and not writing the page to use javascript that can do it on request basis / on the fly? This should get you going http://www.egrappler.com/jquery-broken-link-checker-plugin-jslink/
but I think it would be also possible to create similar logic on your own with jQuery $.get / $.load methods
Not quite appropriate task for Bash.
Option 1: I'd use Java or Groovy, have a SAX handler simply dump all data to output, except for the <a> elements for which it would check the href value, and if broken, add the class="broken" part.
Option 2: Have a XSLT which would call a custom XSLT function on <a> elements. Again, I'd do this with Java, but any language with a good XSLT engine can do that.
Option 3: If you really really want to feel geeky ;-) here's a line to get quite unreliable link checker for Bash:
grep -R '(?:href="(http://[^"]+)")' -ohPI | grep -oP 'http://[^"]+' | sort | uniq | wget -nv -S -O /dev/null -i - 2>&1 | grep -P '(wget:| -> |HTTP/|Location:)'
It could probably get better but I was okay with this.
Option 4: You could employ curl -L ... (the -L follows the redirects) instead of wget.
grep -R '(?:"(http://[^"]+)")' -ohPI | grep -v search.maven.org | grep -oP 'http://[^"]+' | sort | uniq | xargs -I{} sh -c 'echo && echo "$1" && curl -i -I -L -m 5 -s -S "$1"' -- {} 2>&1 | grep -P '(^$|curl:|HTTP/|http://|https://|Location:)'
Pro tip: curl seems to have more scripting friendly output, so you can make it parallel to speed things up: ... | xargs -n 1 -P 8 curl -L ... This will run 8 processes of curl, and pass one argument (URL) at a time. Sorting out the output is up to you, I'd probably create one file for each curl invocation and then concatenated.

Resources