I am have really huge folder I would like to gzip and split them for archive:
#!/bin/bash
dir=$1
name=$2
size=32000m
tar -czf /dev/stdout ${dir} | split -a 5 -d -b $size - ${name}
Are there way to speed up this with gnu parallel?
thanks.
It seems the best tool for parallel gzip compression is pigz. See the comparisons.
With it you can have a command like this:
tar -c "${dir}" | pigz -c | split -a 5 -d -b "${size}" - "${name}"
With its option -p you could also specify the number of threads to use (default is the number of online processors, or 8 if unknown). See pigz --help or man pigz for more info.
UPDATE
Using GNU parallel you could do something this:
contents=("$dir"/*)
outdir=/somewhere
parallel tar -cvpzf "${outdir}/{}.tar.gz" "$dir/{}" ::: "${contents[#]##*/}"
Related
I want to look through 100K+ text files from a directory and copy to another directory only the ones which contain at least one word from a list.
I tried doing an if statement with grep and cp but I have no idea how to make it to work this way.
for filename in *.txt
do
grep -o -i "cultiv" "protec" "agricult" $filename|wc -w
if [ wc -gt 0 ]
then cp $filename ~/desktop/filepath
fi
done
Obviously this does not work but I have no idea how to store the wc result and then compare it to 0 and only act on those files.
Use the -l option to have grep print all the filenames that match the pattern. Then use xargs to pass these as arguments to cp.
grep -l -E -i 'cultiv|protec|agricult' *.txt | xargs cp -t ~/desktop/filepath --
The -t option is a GNU cp extension, it allows you to put the destination directory first so that it will work with xargs.
If you're using a version without that option, you need to use the -J option to xargs to substitute in the middle of the command.
grep -l -E -i 'cultiv|protec|agricult' *.txt | xargs -J {} cp -- {} ~/desktop/filepath
Based on the accepted answer for question wget download files in parallel and rename
cat list.txt | xargs -n 2 -P 4 wget -O
What is the GNU Parallel version of the command?
I have tried cat list.txt | parallel -N2 -j 20 --gnu "wget {2} -O {1} --no-check-certificate" but not success
If:
cat list.txt | xargs -n 2 -P 4 wget -O
works. Then the GNU Parallel version is:
cat list.txt | parallel -n 2 -P 4 wget -O
GNU Parallel is designed to be able to be a drop-in replacement for most xargs situations.
I can run:
echo "asdf" > testfile
tar czf a.tar.gz testfile
tar czf b.tar.gz testfile
md5sum *.tar.gz
and it turns out that a.tar.gz and b.tar.gz have different md5 hashes. It's true that they're different, which diff -u a.tar.gz b.tar.gz confirms.
What additional flags do I need to pass in to tar so that its output is consistent over time with the same input?
tar czf outfile infiles is equivalent to
tar cf - infiles | gzip > outfile
The reason the files are different is because gzip puts its input filename and modification time into the compressed file. When the input is a pipe, it uses an empty string as the filename and the current time as the modification time.
But it also has a --no-name option, which tells it not to put the name and timestamp into the file. So if you write the expanded command explicitly, instead of using the -z option to tar, you can make use of this option.
tar cf - testfile | gzip --no-name > a.tar.gz
tar cf - testfile | gzip --no-name > b.tar.gz
I tested this on OS X 10.6.8 and it works.
For MacOS:
In man tar we can look at --options section and there we will find !timestamp option, which will exclude timestamp from our gzip archive. Usage:
tar --options '!timestamp' -cvzf archive.tgz filename
It will produce same md5 sum for same files with same names
Within a makefile I run the following command
find SOURCE_DIR -name '*.gz' | xargs -P4 -L1 bash -c 'zcat $$1 | grep -F -f <(zcat patternfile.csv.gz) | gzip > TARGET_DIR/$${1##*/}' -
patternfile.csv.gz contains 2M entries with an unzipped file size of 100MB, each file in SOURCE_DIR has a zipped file size of ~20MB.
However, each xargs process consumes more than 6GB of RAM. Does this make sense or do I miss something here?
Thanks for your help.
I would like to convert efficiently a couple of jpeg Images contained in a tar.gz to an x264 mp4 movie.
gzip -cd Monitor-1-xx.tar.gz|cpio -i --to-stdout|jpegtopnm|ppmtoy4m -F 4:1| \
> x264 --crf 24 -o Monitor-1-xx.mp4 --stdin y4m -
The problem here is that, after cpio I have multiple jpg files in a single stream and jpegtopnm only converts the first one.
I would like to find a function to split the stream (or to get it pre-split). Then I would like to run jpegtopnm multiple times for each split. It is somewhat like what xargs does when I untar to disk first. Writing to disk is something I am trying to eschew:
mkdir tmpMonitor && cd tmpMonitor && tar -xf ../Monitor-1-xx.tar.gz
find . -iname "*.jpg"|xargs -n1 jpegtopnm|ppmtoy4m -F 4:1| \
x264 --crf 24 -o ../xx.mp4 --stdin y4m -
cd .. && rm -rf tmpMonitor
Any suggestions?
tar has a couple of options that may be useful here (I have GNU tar, so I apologize in advance for assuming you do in case you actually don't):
--wildcards - lets you pick files to extract from the tar using globs like *.jpeg
--to-command - pipe each extracted file to the given command.
So maybe something like this?
tar -xzf Monitor-1-xx.tar.gz --wildcards '*.jpeg' \
--to-command="jpegtopnm|ppmtoy4m -F 4:1| x264 --crf 24 -o ../xx.mp4 --stdin y4m -"
Well I don't know much about x264 so do consider that untested code. I tested this using simple .txt files instead of .jpegs and cat -n instead of jpegtopnm etc. The other thing is, I am guessing you want separate output files (one per jpeg), so it looks to me like ../xx.mp4 won't do... So assuming you want separate invocations of jpegtopnm|ppmtoy4m -F 4:1| x264 --crf 24 -o ../xx.mp4 --stdin y4m - for each file then you want a different output filename for -o right? - In which case, the following hack might work:
tar -xzf Monitor-1-xx.tar.gz --wildcards '*.jpeg' \
--to-command="jpegtopnm|ppmtoy4m -F 4:1| x264 --crf 24 -o ../xx-`date +%H%M%S%N`.mp4 --stdin y4m -"