Memory problems with xargs and grep and pattern from file - bash

Within a makefile I run the following command
find SOURCE_DIR -name '*.gz' | xargs -P4 -L1 bash -c 'zcat $$1 | grep -F -f <(zcat patternfile.csv.gz) | gzip > TARGET_DIR/$${1##*/}' -
patternfile.csv.gz contains 2M entries with an unzipped file size of 100MB, each file in SOURCE_DIR has a zipped file size of ~20MB.
However, each xargs process consumes more than 6GB of RAM. Does this make sense or do I miss something here?
Thanks for your help.

Related

Efficient method to parse large number of files

I've incoming data which will be in range of 130GBs - 300GBs containing 1000's (maybe millions) of small .txt files of size 2KB - 1MB in a SINGLE folder. I want to parse them efficiently.
I'm looking at the following options (Referred from - 21209029]:
Using printf + xargs (followed by egrep & awk text processing)
printf '%s\0' *.txt | xargs -0 cat | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' > all_in_1.out
Using find + cat (followed by egrep & awk text processing)
find . -name \*.txt -exec cat {} > all_in_1.tmp \;
cat all_in_1.tmp | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' > all_in_1.out
Using for loop
for file in *.txt
do
cat "$file" | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' >> all_in_1.out
done
Which one of the above is the most efficient? Is there a better way to do it?
Or is using shell commands not at all recommended to handle this amount of data processing (I do prefer a shell way for this)?
The server has RHEL 6.5 OS with 32 GB memory with 16 Cores (#2.2GHz).
Approach 1 and 3 expand the list of files on the shell command line. This will not work with a huge number of files. Approach 1 and 3 also do not work if the files are distributed across many directories (which is likely with millions of files).
Approach 2 makes a copy of all data, so it is inefficient as well.
You should use find and pass the file names directly to egrep. Use the -h option to suppress the prefix with the file name:
find . -name \*.txt -print0 \
| xargs -0 egrep -i -v -h 'pattern1|...|pattern8' \
| awk '{gsub(/"\t",",")}1' > all_in_1.out
xargs will automatically launch multiple egrep processes in sequence to avoid exceeding the command line limit in a single invocation.
Depending on the file contents, it may also be more efficient to avoid the egrep processes altogether, and do the filtering directly in awk:
find . -name \*.txt -print0 \
| xargs -0 awk 'BEGIN { IGNORECASE = 1 } ! /pattern1|...|pattern8/ {gsub(/"\t",",")}1' > all_in_1.out
BEGIN { IGNORECASE = 1 } corresponds to the -i option of egrep, and the ! inverts the sense of the matching, just like -v. IGNORECASE appears to be a GNU extension.

Find last created tar.gz and extract it

I need to find last created tar.gz file and extract it to some directory, something like this:
ls -t $(pwd)/Backup_db/ | head -1 | xargs tar xf -C /somedirectory
How to do it the right way in CentOS 7?
You can find out the most recently edited file in a subshell, and then use that in place of a filename. The new directory can be created, and then the tar file can be extracted to it.
new_dir="path/to/new/dir"
mkdir -p $new_dir
tar -zxvf $(ls -t *.tar.gz | head -1) -C $new_dir
Note that ls -t <dir> will not show the full <dir>/<filename> path for the files, but ls -t <dir>/* will, so after also reordering xargs flags (and forcing -n1 for safety), below should work for you:
ls -t $(pwd)/Backup_db/*.tar.gz | head -1 | xargs -n1 tar -C /somedirectory -xf

Delete corrupt gz archives with "xargs rm"

I would like to pre-process a directory of .gz files before submitting them to Hadoop/Spark. This is to avoid issues, such as these ones. The following bash pipeline almost does what I need, except that xargs rm doesn't seem to delete the files that fail the gunzip -t test.
gunzip -t *.gz 2>&1 | cut -f 2 -d: - | xargs rm
The pipeline works silently. Yet when gunzip -t *.gz is called again, it prints out
gzip: unhappy.gz: unexpected end of file
or similar.
For some reason, it looks as though this only deletes one file, then finishes. A (more complicated) pipeline that invokes xargs twice seems to work much more reliably:
ls *.gz | xargs -n 1 gunzip -t 2>&1 | cut -f 2 -d: - | xargs -t -n 1 rm
Decomposed, this pipeline says:
ls *.gz: list all .gz files
xargs -n 1 gunzip -t 2>&1: send that list one at a time (-n 1) to gunzip -t, to test the input
cut -f 2 -d: -: extract the filename from the output of gunzip, which is the second field (-f 2) of the line delimited by : character
xargs -t -n 1 rm: send the output of cut to rm one filename at a time, printing out progress (-t) as it operates

Scaling up grep find and copy to large folder (xargs?)

I would like to search a directory for any file that matches any of a list of words. If a file matches, I would like to copy that file into a new directory. I created a small batch of test files and got the following code working:
cp `grep -lir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation'` '/Users/newlocation'
Unfortunately, when I run this code on a large folder with a few thousand files it says the argument list is too long for cp. I think I need to loop this or use a xargs but I can't figure out how to make the conversion.
The minimal change from what you have would be:
grep -lir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs cp -t '/Users/newlocation'
But, don't use that. Because you never know when you will encounter a filename with spaces or newlines in it, null-terminated strings should be used. On linux/GNU, add the -Z option to grep and -0 to xargs:
grep -Zlir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs -0 cp -t '/Users/newlocation'
On Macs (and AIX, HP-UX, Solaris, *BSD), the grep options change slightly but, more importantly, the GNU cp -t option is not available. A workaround is:
grep -lir --null 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs -0 -I fname cp fname '/Users/newlocation'
This is less efficient because a new instance of cp has to be run for each file to be copied.
Alternative solution for those without grep -r. Using find + egrep + xargs , hope there is no file with same file name in different folders. Secondly, I replaced the ugly style of word\|word2\|word3\|word4\|word5
find . -type f -exec egrep -l 'word|word2|word3|word4|word5' {} \; |xargs -i cp {} /LARGE_FOLDER

How to ignore file types from being archived in my shell script?

I have a shell script that runs on AIX 7.1 and the purpose of it is to archive a bunch of different directories using CPIO.
We pass in the directories to be archived to CPIO from a flat file called synclive_cpio.list.
Here is a snippet of the script..
#!/bin/ksh
CPIO_LIST="$BASE/synclive_cpio.list"
DUMP_DIR=/usr4/sync_stage
LOG_FILE=/tmp/synclive.log
run_cpio()
{
while LINE=: read -r f1 f2
do
sleep 1
cd $f1
echo "Sending CPIO processing for $f1 to background." >> $LOG_FILE
time find . -print | cpio -o | gzip > $DUMP_DIR/$f2.cpio.gz &
done <"$CPIO_LIST"
wait
}
Here is what we have in the synclive.cpio.list file...
/usr2/devel_config usr2_devel_config
/usr/local usr_local
/usr1/config usr1_config
/usr1/releases usr1_releases
When the CPIO is running, it will archive everything in the passed directory.. what I would like to do is try to exclude a few file extension types such as *.log and *.tmp as we don't need to archive those.
Any idea how to change run_cpio() block to support this?
Thanks.
Exclude them in the find command
time find . ! \( -name '*.log' -o -name '*.tmp' \) -print | cpio -o | gzip > $DUMP_DIR/$f2.cpio.gz &
You can keep adding -o *.suffix to the list.
Another method is to use egrep -v in the pipeline. I believe that would be a teeny tiny bit more efficient.
time find . -print | egrep -v '\.(tmp|log)$' | cpio -o | gz
To me, it is also easier to read and modify if the list gets really long.

Resources