parallel wget download url list and rename - bash

Based on the accepted answer for question wget download files in parallel and rename
cat list.txt | xargs -n 2 -P 4 wget -O
What is the GNU Parallel version of the command?
I have tried cat list.txt | parallel -N2 -j 20 --gnu "wget {2} -O {1} --no-check-certificate" but not success

If:
cat list.txt | xargs -n 2 -P 4 wget -O
works. Then the GNU Parallel version is:
cat list.txt | parallel -n 2 -P 4 wget -O
GNU Parallel is designed to be able to be a drop-in replacement for most xargs situations.

Related

Listing existing files that are not present in a list using shell

How do I list files that exist, but not present in the list? More specifically, I'd like to remove *.cpp files not listed in Build. Something like this lists files that are present in both the current directory and the Build file:
ls *.cpp | xargs -I % bash -c 'grep % Build'
However, the following line is incorrect of course:
ls *.cpp | xargs -I % bash -c 'grep -v % Build'
Thus the question: how does one list the *.cpp files that are not present in the Build file using shell commands? I can do something like this, bug this is ugly:
ls *.cpp | perl -e 'while(<>){chomp;my $l=`grep $_ Build`;chomp $l;if(length $l==0){print("rm $_\n");}}'
More specifically, I'd like to remove *.cpp files not listed in Build
You want comm or join to join two sorted lists together. I always mix comm arguments, but I think:
comm -23 <(find . -type f -name '*.cpp' | sort) <(sort Build) |
xargs -n '\n' echo rm
or if you want to depend on filename expansion:
shopt -s nullglob # at least
comm -23 <(printf "%s\n" *.cpp | sort) <(sort Build) | ...
Do not parse `ls.
The <(...) is bash specific process substitution. In non-bash shell just create a temporary file with the output of processes.
GNU grep already offers you this possibility with the -f switch:
printf '%s\n' *.cpp | grep -F -x -v -f Build
-F: no regex
-x: full-line match
-v: invert (not match)
-f: any of the line in Build
In other words: filter out any line in Build

Bash: Download files from file list

I have a file named files.txt with all files which I want download.
files.txt
http://file/to/download/IC_0000.tpl
http://file/to/download/IC_0001.tpl
If I use
cat files.txt | egrep -v "(^#.*|^$)" | xargs -n 1 wget
all files are downloaded.
But I dont know how to use If files.txt contains only files without http
files.txt
IC_0000.tpl
IC_0001.tpl
I have "wget" only with this paramter:
Usage: wget [-c|--continue] [-s|--spider] [-q|--quiet] [-O|--output-document FILE]
[--header 'header: value'] [-Y|--proxy on/off] [-P DIR]
[--no-check-certificate] [-U|--user-agent AGENT] [-T SEC] URL...
Can you help me, please.
Many thanks.
Simply try wget -i files.txt (see http://www.gnu.org/software/wget/manual/wget.html#Logging-and-Input-File-Options)
If you don't have the host in the file, try:
for i in `cat files.txt`; do wget "${HOST}/${i}"; done
Just leave it here...
for MacOSX
file_name=newMP3List.txt && cur_path=$(pwd) && split -l 50 $file_name PART && find . -name "PART*" -print0 | xargs -0 -I f osascript -e "tell application \"Terminal\" to do script \"cd $cur_path && cat f | while read CMD; do curl -O \\\"\$CMD\\\"; done; rm f;\""
for linux
cat newMP3List.txt | while read CMD; do curl -O $CMD; done;
need to replace newMP3List.txt with your filename

Redirect output of xargs to file

I want to delete the first line of every files of a directory and save the corresponding output by appending a '.tmp' at the end of each of the filename. For example, if there is a file named input.txt with following content:
line 1
line 2
I want to create a file in the same directory with name input.txt.tmp which will have the following content
line 2
I'm trying this command:
find . -type f | xargs -I '{}' tail -n +2 '{}' > '{}'.tmp
The problem is, instead of writing output to separate files with .tmp suffix, it creates just one single file named {}.tmp. I understand that this is happening because the output redirection is done after xargs is completely finished. But is there any way to tell xargs that the output redirection is a part of it's argument?
Note you can use find together with -exec, without need to pipe to xargs:
find . -type f -exec sh -c 'f={}; tail -n+2 $f > $f.tmp' \;
^^^^ ^^^^^^^^^^^^^^^^^^^^^
| perform the tail and redirection
store the name of the file
If you have GNU Parallel you can run:
find . -type f | parallel tail -n +2 {} '>' {}.tmp
All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:
Run the same program on many files
Run the same program for every line in a file
Run the same program for every block in a file
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Parallel tar with split for large folders

I am have really huge folder I would like to gzip and split them for archive:
#!/bin/bash
dir=$1
name=$2
size=32000m
tar -czf /dev/stdout ${dir} | split -a 5 -d -b $size - ${name}
Are there way to speed up this with gnu parallel?
thanks.
It seems the best tool for parallel gzip compression is pigz. See the comparisons.
With it you can have a command like this:
tar -c "${dir}" | pigz -c | split -a 5 -d -b "${size}" - "${name}"
With its option -p you could also specify the number of threads to use (default is the number of online processors, or 8 if unknown). See pigz --help or man pigz for more info.
UPDATE
Using GNU parallel you could do something this:
contents=("$dir"/*)
outdir=/somewhere
parallel tar -cvpzf "${outdir}/{}.tar.gz" "$dir/{}" ::: "${contents[#]##*/}"

xargs into different files

I have a bash 'for loop' that does what I want
for i in *.data
do
./prog $i >dir/$i.bck
done
Can I turn this into an xargs construct ?
I've tried something like
ls *.data|xargs -n1 -I FILE ./prog FILE >dir/FILE.bck
But I have problems with the FILE rightside of '>'
thanks
Give this a try (you can use FILE instead of % if you prefer):
find -maxdepth 1 -name '*.data' -print0 | xargs -0 -n1 -I % sh -c './prog % > dir/%.bck'
GNU Parallel http://www.gnu.org/software/parallel/ is designed for this kind of tasks:
ls *.data | parallel ./prog {} '>'dir/{}.bck
IMHO this is more readable than the xargs solution provided.
Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Resources