Parallel tar with split for large folders

Parallel tar with split for large folders - bash

I am have really huge folder I would like to gzip and split them for archive:
#!/bin/bash
dir=$1
name=$2
size=32000m
tar -czf /dev/stdout ${dir} | split -a 5 -d -b $size - ${name}
Are there way to speed up this with gnu parallel?
thanks.

It seems the best tool for parallel gzip compression is pigz. See the comparisons.
With it you can have a command like this:
tar -c "${dir}" | pigz -c | split -a 5 -d -b "${size}" - "${name}"
With its option -p you could also specify the number of threads to use (default is the number of online processors, or 8 if unknown). See pigz --help or man pigz for more info.
UPDATE
Using GNU parallel you could do something this:
contents=("$dir"/*)
outdir=/somewhere
parallel tar -cvpzf "${outdir}/{}.tar.gz" "$dir/{}" ::: "${contents[#]##*/}"

Related

Copy files that have at least the mention of one certain word

I want to look through 100K+ text files from a directory and copy to another directory only the ones which contain at least one word from a list.
I tried doing an if statement with grep and cp but I have no idea how to make it to work this way.
for filename in *.txt
do
grep -o -i "cultiv" "protec" "agricult" $filename|wc -w
if [ wc -gt 0 ]
then cp $filename ~/desktop/filepath
fi
done
Obviously this does not work but I have no idea how to store the wc result and then compare it to 0 and only act on those files.

Use the -l option to have grep print all the filenames that match the pattern. Then use xargs to pass these as arguments to cp.
grep -l -E -i 'cultiv|protec|agricult' *.txt | xargs cp -t ~/desktop/filepath --
The -t option is a GNU cp extension, it allows you to put the destination directory first so that it will work with xargs.
If you're using a version without that option, you need to use the -J option to xargs to substitute in the middle of the command.
grep -l -E -i 'cultiv|protec|agricult' *.txt | xargs -J {} cp -- {} ~/desktop/filepath

parallel wget download url list and rename

Based on the accepted answer for question wget download files in parallel and rename
cat list.txt | xargs -n 2 -P 4 wget -O
What is the GNU Parallel version of the command?
I have tried cat list.txt | parallel -N2 -j 20 --gnu "wget {2} -O {1} --no-check-certificate" but not success

If:
cat list.txt | xargs -n 2 -P 4 wget -O
works. Then the GNU Parallel version is:
cat list.txt | parallel -n 2 -P 4 wget -O
GNU Parallel is designed to be able to be a drop-in replacement for most xargs situations.

Using shasum along with compression in MAC OS-X [duplicate]

I can run:
echo "asdf" > testfile
tar czf a.tar.gz testfile
tar czf b.tar.gz testfile
md5sum *.tar.gz
and it turns out that a.tar.gz and b.tar.gz have different md5 hashes. It's true that they're different, which diff -u a.tar.gz b.tar.gz confirms.
What additional flags do I need to pass in to tar so that its output is consistent over time with the same input?

tar czf outfile infiles is equivalent to
tar cf - infiles | gzip > outfile
The reason the files are different is because gzip puts its input filename and modification time into the compressed file. When the input is a pipe, it uses an empty string as the filename and the current time as the modification time.
But it also has a --no-name option, which tells it not to put the name and timestamp into the file. So if you write the expanded command explicitly, instead of using the -z option to tar, you can make use of this option.
tar cf - testfile | gzip --no-name > a.tar.gz
tar cf - testfile | gzip --no-name > b.tar.gz
I tested this on OS X 10.6.8 and it works.

For MacOS:
In man tar we can look at --options section and there we will find !timestamp option, which will exclude timestamp from our gzip archive. Usage:
tar --options '!timestamp' -cvzf archive.tgz filename
It will produce same md5 sum for same files with same names

Memory problems with xargs and grep and pattern from file

Within a makefile I run the following command
find SOURCE_DIR -name '*.gz' | xargs -P4 -L1 bash -c 'zcat $$1 | grep -F -f <(zcat patternfile.csv.gz) | gzip > TARGET_DIR/$${1##*/}' -
patternfile.csv.gz contains 2M entries with an unzipped file size of 100MB, each file in SOURCE_DIR has a zipped file size of ~20MB.
However, each xargs process consumes more than 6GB of RAM. Does this make sense or do I miss something here?
Thanks for your help.

Stream chain after Untar

I would like to convert efficiently a couple of jpeg Images contained in a tar.gz to an x264 mp4 movie.
gzip -cd Monitor-1-xx.tar.gz|cpio -i --to-stdout|jpegtopnm|ppmtoy4m -F 4:1| \
> x264 --crf 24 -o Monitor-1-xx.mp4 --stdin y4m -
The problem here is that, after cpio I have multiple jpg files in a single stream and jpegtopnm only converts the first one.
I would like to find a function to split the stream (or to get it pre-split). Then I would like to run jpegtopnm multiple times for each split. It is somewhat like what xargs does when I untar to disk first. Writing to disk is something I am trying to eschew:
mkdir tmpMonitor && cd tmpMonitor && tar -xf ../Monitor-1-xx.tar.gz
find . -iname "*.jpg"|xargs -n1 jpegtopnm|ppmtoy4m -F 4:1| \
x264 --crf 24 -o ../xx.mp4 --stdin y4m -
cd .. && rm -rf tmpMonitor
Any suggestions?

tar has a couple of options that may be useful here (I have GNU tar, so I apologize in advance for assuming you do in case you actually don't):
--wildcards - lets you pick files to extract from the tar using globs like *.jpeg
--to-command - pipe each extracted file to the given command.
So maybe something like this?
tar -xzf Monitor-1-xx.tar.gz --wildcards '*.jpeg' \
--to-command="jpegtopnm|ppmtoy4m -F 4:1| x264 --crf 24 -o ../xx.mp4 --stdin y4m -"
Well I don't know much about x264 so do consider that untested code. I tested this using simple .txt files instead of .jpegs and cat -n instead of jpegtopnm etc. The other thing is, I am guessing you want separate output files (one per jpeg), so it looks to me like ../xx.mp4 won't do... So assuming you want separate invocations of jpegtopnm|ppmtoy4m -F 4:1| x264 --crf 24 -o ../xx.mp4 --stdin y4m - for each file then you want a different output filename for -o right? - In which case, the following hack might work:
tar -xzf Monitor-1-xx.tar.gz --wildcards '*.jpeg' \
--to-command="jpegtopnm|ppmtoy4m -F 4:1| x264 --crf 24 -o ../xx-`date +%H%M%S%N`.mp4 --stdin y4m -"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parallel tar with split for large folders - bash

I am have really huge folder I would like to gzip and split them for archive: #!/bin/bash dir=$1 name=$2 size=32000m tar -czf /dev/stdout ${dir} | split -a 5 -d -b $size - ${name} Are there way to speed up this with gnu parallel? thanks.

Related

Copy files that have at least the mention of one certain word

parallel wget download url list and rename

Using shasum along with compression in MAC OS-X [duplicate]

Memory problems with xargs and grep and pattern from file

Stream chain after Untar

Categories

Resources