GNU parallel not spawning jobs - bash

After an upgrade to Debian 8.6 Jessie the GNU parallel script suddenly stopped parallelizing to more than 2 jobs with the --pipe and -L options.
Before the upgrade the command:
cat file_with_1064_lines.txt | parallel -L10 -j5 -k -v --pipe "wc -l"
spawned 5 processes, which output this:
wc -l
10
wc -l
10
...
The same command after the upgrade:
wc -l
1060
wc -l
4
(The two values above change with respect to the -L option value -- the first is L*floor(1064/L) and the second is 1064 mod L, but there always only two processes outputting.)
The same is observed independently of the parallel version (tested the latest and one from 2013).
PS.
$ uname -a
Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
$ parallel --version
GNU parallel 20161222

-L is the record size. The bug was fixed around 20130122. What you want is to read 1 record of 10 lines:
parallel -L10 -N1 -j5 -k -v --pipe wc -l
or 10 records of 1 line:
parallel -L1 -N10 -j5 -k -v --pipe wc -l

Related

xargs doesn't work on SUSE

This problem occurs only on suse, it works on ubuntu and even on windows through babun
My point is to replace a word in several files with another word.
This is what I'm trying:
$ grep -Inrs MY_PATTERN src/ | cut -d: -f1 | xargs sed -i 's/MY_PATTERN/NEW_WORD/g'
sed: can't read path/to/a/found/file_1: No such file or directory
sed: can't read path/to/a/found/file_2: No such file or directory
sed: can't read path/to/a/found/file_3: No such file or directory
...
Knowing that
$ grep -Inrs MY_PATTERN src/ | cut -d: -f1
path/to/a/found/file_1
path/to/a/found/file_2
path/to/a/found/file_3
UPDATE1
This doesn't work either
$ grep -lZ -Irs MY_PATTERN src/ | xargs -0 ls
ls: cannot access path/to/a/found/file_1: No such file or directory
ls: cannot access path/to/a/found/file_2: No such file or directory
ls: cannot access path/to/a/found/file_3: No such file or directory
...
$ ls -al path/to/a/found/file_1 | cat -vet
-rw-r--r-- 1 webme 886 Feb 1 13:36 path/to/a/found/file_1$
UPDATE2
$ whoami
webme
$ uname -a
Linux server_vm_id_34 3.0.101-68-default #1 SMP Tue Dec 1 16:21:37 UTC 2015 (ed01a9f) x86_64 x86_64 x86_64 GNU/Linux
UPDATE3
$ grep --version
grep (GNU grep) 2.7
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ xargs --version
xargs (GNU findutils) 4.4.0
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Built using GNU gnulib version e5573b1bad88bfabcda181b9e0125fb0c52b7d3b
# My project path /home/users/webme/projects/my_project
$ df -T
dl360d-01:/homeweb/users/webme nfs 492625920 461336576 31289344 94% /home/users/webme
$ id
uid=1689(webme) gid=325(web) groups=325(web)
$ mount -v
/dev/vda2 on / type btrfs (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/vda1 on /boot type ext3 (rw,acl,user_xattr)
/dev/vdb on /appwebinet type xfs (rw)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
dl360d-01:/homeweb/users on /homeweb/users type nfs (rw,soft,bg,addr=xx.xx.xx.xx)
dl360d-01:/appwebinet/tools/list on /appwebinetdev/tools/list type nfs (ro,soft,sloppy,addr=xx.xxx.xx.xx)
dl360d-01:/homeweb/users/webme on /home/users/webme type nfs (rw,soft,bg,addr=xx.xxx.xx.xx)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
none on /var/lib/ntp/proc type proc (ro,nosuid,nodev)
I don't have exportfs, neither I have it in sudo zypper install exportfs
I'd suggest keeping it simple:
$ grep -lZ -Irs foo * | xargs -0 sed -i 's/foo/bar/g'
grep -l outputs matching file names only, which is really what you want in this pipeline. grep -Z terminates each matching file name with a NUL, which xargs -0 can pick up. This allows for file names with embedded white-space to pass between the grep and the xargs unfettered.
# show the structure
$ tree
.
└── path
└── to
└── a
└── found
├── file_1
├── file_2
└── file_3
# show the contents
$ grep . path/to/a/found/file_*
path/to/a/found/file_1:a foo bar
path/to/a/found/file_2:a foo bar
path/to/a/found/file_3:a foo bar
# try it out
$ grep -lZ -Irs foo * | xargs -0 ls -l
-rw-rw-r-- 1 bishop bishop 10 Feb 13 09:13 path/to/a/found/file_1
-rw-rw-r-- 1 bishop bishop 10 Feb 13 09:13 path/to/a/found/file_2
-rw-rw-r-- 1 bishop bishop 10 Feb 13 09:13 path/to/a/found/file_3
bishop's helpful answer is the simplest and most robust solution in this case.
This answer may still be of interest for (a) a discussion of the -d vs. the -n options, and (b) how to preview the command(s) that xargs would execute.
From what I understand, SUSE uses GNU utilities, so you can use xargs -d'\n':
grep -Inrs MY_PATTERN src/ | cut -d: -f1 | xargs -d'\n' sed -i 's/MY_PATTERN/NEW_WORD/g'
xargs -d'\n' ensures that each input line as a whole is treated as its own argument (preserving the integrity of filenames with spaces), while still passing as many arguments as possible (typically, all) at once.
(By contrast, -n 1 would break arguments by whitespace, and call the target command with 1 argument at a time.)
If you want to preview the command that would be executed, use an aux. bash command:
grep -Inrs MY_PATTERN src/ | cut -d: -f1 |
xargs -d'\n' bash -c 'printf "%q " "$#"' _ sed -i 's/MY_PATTERN/NEW_WORD/g'
Read on for an explanation.
Optional background information.
xargs has its own option, -p, for previewing the command(s) to execute and prompting for confirmation, but the preview doesn't reflect argument boundaries in the way you'd have to indicate them when calling the command directly from the shell.
A quick example:
$ echo 'hi, there' | xargs -p -d '\n' printf '%s'
printf %s hi, there ?...
What xargs will actually execute is the equivalent of printf '%s' 'hi, there', but that is not reflected in -p's prompt.
Workaround:
$ echo 'hi, there' | xargs -d '\n' bash -c 'printf "%q " "$#"' _ printf '%s'
printf %s hi\,\ there
The generic auxiliary bash command - bash -c 'printf "%q " "$#"' _, inserted just before the target command - quotes the arguments that xargs passes - on demand, only if necessary - in a way that would be required for the shell to recognize each as a single argument, and joins them with spaces.
The net result is that a shell command is printed that is the equivalent of what xargs would execute (though, as you can see, there is no guarantee that the input quoting style is retained).

Efficient method for parallel processing in Bash /Shell ?

I have a text file ( Input.txt ) containing domains and that is total of about 35 Millions domains.
#Input.txt
google.com
cnn.com
bbc.com
........
Now ,I have a python script to check the status code of each and every domains associated with in the text file ( Input.txt ). For smaller set, I do
for i in $(cat Input.txt);do python status_check.py $i;done > out_file.txt
If i process in this manner,It might take ages to check the status code for all 35 million domains.
I'm not familiar in parallel processing. Can some one help me on,How to achieve the task by saving time using shell/bash/any ?
You are looking for GNU Parallel:
cat Input.txt | parallel -j 100 python status_check.py > out_file.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Put an ampersand after your $1 and it will run each "concurrently"
Bash is probably not the right tool to do this. Each fork is very expensive resource-wise. You'd be better off using Ruby or Python, reading this into an array and then processing it inside the interpreter's VM.
Why not alter your python script to read the URLs itself and then distribute the processing?
It seems a bit pointless having a bash for-loop when you could just do that in python.
There are a number of modules in python for handling parallel processing listed here.

parallel check md5 file

I have a md5sum file containing lots of lines. I want to use GNU parallel to accelerate the md5sum checking process. In the md5sum, when no file input, it will take the md5 string from stdin. I tried this:
cat checksums.md5 | parallel md5sum -c {}
But getting this error:
md5sum 445350b414a8031d9dd6b1e68a6f2367 testing.gz: No such file or directory
How can I parallel the md5sum checking?
Assuming checksums.md5 has the format:
d41d8cd98f00b204e9800998ecf8427e My file name
Run:
cat checksums.md5 | parallel --pipe -N1 md5sum -c
If your files are small: -N100
If that does not speed up your processing make sure your disks are fast enough: md5sum can process 500 MB/s. iostat -dkx 1 can tell you if your disks are a bottleneck.
You need option --pipe. In this mode parallel splits stdin into blocks and supplies each block to the command via stdin, see man parallel for details:
cat checksums.md5 | parallel --pipe md5sum -c -
By default size of the block is 1 MB, can be changed with --block option.

how to ping each ip in a file?

I have a file named "ips" containing all ips I need to ping. In order to ping those IPs, I use the following code:
cat ips|xargs ping -c 2
but the console show me the usage of ping, I don't know how to do it correctly. I'm using Mac os
You need to use the option -n1 with xargs to pass one IP at time as ping doesn't support multiple IPs:
$ cat ips | xargs -n1 ping -c 2
Demo:
$ cat ips
127.0.0.1
google.com
bbc.co.uk
$ cat ips | xargs echo ping -c 2
ping -c 2 127.0.0.1 google.com bbc.co.uk
$ cat ips | xargs -n1 echo ping -c 2
ping -c 2 127.0.0.1
ping -c 2 google.com
ping -c 2 bbc.co.uk
# Drop the UUOC and redirect the input
$ xargs -n1 echo ping -c 2 < ips
ping -c 2 127.0.0.1
ping -c 2 google.com
ping -c 2 bbc.co.uk
With ip or hostname in each line of ips file:
( while read ip; do ping -c 2 $ip; done ) < ips
You can also change timeout, with -W flag, so if some hosts isn'up, it wont lock your script for too much time. Also -q for quiet output is useful in this case.
( while read ip; do ping -c1 -W1 -q $ip; done ) < ips
If the file is 1 ip per line (and it's not overly large), you can do it with a for loop:
for ip in $(cat ips); do
ping -c 2 $ip;
done
You could use fping. It also does that in parallel and has script friendly output.
$ cat ips | xargs fping -q -C 3
10.xx.xx.xx : 201.39 203.62 200.77
10.xx.xx.xx : 288.10 287.25 288.02
10.xx.xx.xx : 187.62 187.86 188.69
...
With GNU Parallel you would do:
parallel -j0 ping -c 2 {} :::: ips
This will run as many jobs in parallel as you have ips or processes.
It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Try doing this :
cat ips | xargs -i% ping -c 2 %
As suggested by #Lupus you can use "fping", but the output is not human friendly - it will scroll out of your screen in few seconds laving you with no trace as of what is going on. To address this I've just released ping-xray. I tried to make it as visual as possible under ascii terminal plus it creates CSV logs with exact millisecond resolution for all targets.
https://dimon.ca/ping-xray/
Hope you'll find it helpful.

Strange behavior of uniq on darwin shells

I've used 'uniq -d -c file' in many shell scripts on linux machines, and it works.
On my MAC (OS X 10.6.7 with developer tools installed) it doesn't seems to work:
$ uniq -d -c testfile.txt
usage: uniq [-c | -d | -u] [-i] [-f fields] [-s chars] [input [output]]
It would be nice if anyone could checks this.
Well, it's right there in the Usage message. [ -c | -d | -u] means you can use one of those possibilities, not two.
Since OSX is based on BSD, you can check that here or, thanks to Ignacio, the more Apple-specific one here.
If you want to achieve a similar output, you could use:
do_your_thing | uniq -c | grep -v '^ *1 '
which will strip out all those coalesced lines that have a count of one.
You can try this awk solution
awk '{a[$0]++}END{for(i in a)if(a[i]>1){ print i ,a[i] } }' file

Resources