Memory problems with xargs and grep and pattern from file

Memory problems with xargs and grep and pattern from file - bash

Within a makefile I run the following command
find SOURCE_DIR -name '*.gz' | xargs -P4 -L1 bash -c 'zcat $$1 | grep -F -f <(zcat patternfile.csv.gz) | gzip > TARGET_DIR/$${1##*/}' -
patternfile.csv.gz contains 2M entries with an unzipped file size of 100MB, each file in SOURCE_DIR has a zipped file size of ~20MB.
However, each xargs process consumes more than 6GB of RAM. Does this make sense or do I miss something here?
Thanks for your help.

Related

Efficient method to parse large number of files

I've incoming data which will be in range of 130GBs - 300GBs containing 1000's (maybe millions) of small .txt files of size 2KB - 1MB in a SINGLE folder. I want to parse them efficiently.
I'm looking at the following options (Referred from - 21209029]:
Using printf + xargs (followed by egrep & awk text processing)
printf '%s\0' *.txt | xargs -0 cat | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' > all_in_1.out
Using find + cat (followed by egrep & awk text processing)
find . -name \*.txt -exec cat {} > all_in_1.tmp \;
cat all_in_1.tmp | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' > all_in_1.out
Using for loop
for file in *.txt
do
cat "$file" | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' >> all_in_1.out
done
Which one of the above is the most efficient? Is there a better way to do it?
Or is using shell commands not at all recommended to handle this amount of data processing (I do prefer a shell way for this)?
The server has RHEL 6.5 OS with 32 GB memory with 16 Cores (#2.2GHz).

Approach 1 and 3 expand the list of files on the shell command line. This will not work with a huge number of files. Approach 1 and 3 also do not work if the files are distributed across many directories (which is likely with millions of files).
Approach 2 makes a copy of all data, so it is inefficient as well.
You should use find and pass the file names directly to egrep. Use the -h option to suppress the prefix with the file name:
find . -name \*.txt -print0 \
| xargs -0 egrep -i -v -h 'pattern1|...|pattern8' \
| awk '{gsub(/"\t",",")}1' > all_in_1.out
xargs will automatically launch multiple egrep processes in sequence to avoid exceeding the command line limit in a single invocation.
Depending on the file contents, it may also be more efficient to avoid the egrep processes altogether, and do the filtering directly in awk:
find . -name \*.txt -print0 \
| xargs -0 awk 'BEGIN { IGNORECASE = 1 } ! /pattern1|...|pattern8/ {gsub(/"\t",",")}1' > all_in_1.out
BEGIN { IGNORECASE = 1 } corresponds to the -i option of egrep, and the ! inverts the sense of the matching, just like -v. IGNORECASE appears to be a GNU extension.

Find last created tar.gz and extract it

I need to find last created tar.gz file and extract it to some directory, something like this:
ls -t $(pwd)/Backup_db/ | head -1 | xargs tar xf -C /somedirectory
How to do it the right way in CentOS 7?

You can find out the most recently edited file in a subshell, and then use that in place of a filename. The new directory can be created, and then the tar file can be extracted to it.
new_dir="path/to/new/dir"
mkdir -p $new_dir
tar -zxvf $(ls -t *.tar.gz | head -1) -C $new_dir

Note that ls -t <dir> will not show the full <dir>/<filename> path for the files, but ls -t <dir>/* will, so after also reordering xargs flags (and forcing -n1 for safety), below should work for you:
ls -t $(pwd)/Backup_db/*.tar.gz | head -1 | xargs -n1 tar -C /somedirectory -xf

Delete corrupt gz archives with "xargs rm"

I would like to pre-process a directory of .gz files before submitting them to Hadoop/Spark. This is to avoid issues, such as these ones. The following bash pipeline almost does what I need, except that xargs rm doesn't seem to delete the files that fail the gunzip -t test.
gunzip -t *.gz 2>&1 | cut -f 2 -d: - | xargs rm
The pipeline works silently. Yet when gunzip -t *.gz is called again, it prints out
gzip: unhappy.gz: unexpected end of file
or similar.

For some reason, it looks as though this only deletes one file, then finishes. A (more complicated) pipeline that invokes xargs twice seems to work much more reliably:
ls *.gz | xargs -n 1 gunzip -t 2>&1 | cut -f 2 -d: - | xargs -t -n 1 rm
Decomposed, this pipeline says:
ls *.gz: list all .gz files
xargs -n 1 gunzip -t 2>&1: send that list one at a time (-n 1) to gunzip -t, to test the input
cut -f 2 -d: -: extract the filename from the output of gunzip, which is the second field (-f 2) of the line delimited by : character
xargs -t -n 1 rm: send the output of cut to rm one filename at a time, printing out progress (-t) as it operates

Scaling up grep find and copy to large folder (xargs?)

I would like to search a directory for any file that matches any of a list of words. If a file matches, I would like to copy that file into a new directory. I created a small batch of test files and got the following code working:
cp `grep -lir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation'` '/Users/newlocation'
Unfortunately, when I run this code on a large folder with a few thousand files it says the argument list is too long for cp. I think I need to loop this or use a xargs but I can't figure out how to make the conversion.

The minimal change from what you have would be:
grep -lir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs cp -t '/Users/newlocation'
But, don't use that. Because you never know when you will encounter a filename with spaces or newlines in it, null-terminated strings should be used. On linux/GNU, add the -Z option to grep and -0 to xargs:
grep -Zlir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs -0 cp -t '/Users/newlocation'
On Macs (and AIX, HP-UX, Solaris, *BSD), the grep options change slightly but, more importantly, the GNU cp -t option is not available. A workaround is:
grep -lir --null 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs -0 -I fname cp fname '/Users/newlocation'
This is less efficient because a new instance of cp has to be run for each file to be copied.

Alternative solution for those without grep -r. Using find + egrep + xargs , hope there is no file with same file name in different folders. Secondly, I replaced the ugly style of word\|word2\|word3\|word4\|word5
find . -type f -exec egrep -l 'word|word2|word3|word4|word5' {} \; |xargs -i cp {} /LARGE_FOLDER

How to ignore file types from being archived in my shell script?

I have a shell script that runs on AIX 7.1 and the purpose of it is to archive a bunch of different directories using CPIO.
We pass in the directories to be archived to CPIO from a flat file called synclive_cpio.list.
Here is a snippet of the script..
#!/bin/ksh
CPIO_LIST="$BASE/synclive_cpio.list"
DUMP_DIR=/usr4/sync_stage
LOG_FILE=/tmp/synclive.log
run_cpio()
{
while LINE=: read -r f1 f2
do
sleep 1
cd $f1
echo "Sending CPIO processing for $f1 to background." >> $LOG_FILE
time find . -print | cpio -o | gzip > $DUMP_DIR/$f2.cpio.gz &
done <"$CPIO_LIST"
wait
}
Here is what we have in the synclive.cpio.list file...
/usr2/devel_config usr2_devel_config
/usr/local usr_local
/usr1/config usr1_config
/usr1/releases usr1_releases
When the CPIO is running, it will archive everything in the passed directory.. what I would like to do is try to exclude a few file extension types such as *.log and *.tmp as we don't need to archive those.
Any idea how to change run_cpio() block to support this?
Thanks.

Exclude them in the find command
time find . ! \( -name '*.log' -o -name '*.tmp' \) -print | cpio -o | gzip > $DUMP_DIR/$f2.cpio.gz &
You can keep adding -o *.suffix to the list.

Another method is to use egrep -v in the pipeline. I believe that would be a teeny tiny bit more efficient.
time find . -print | egrep -v '\.(tmp|log)$' | cpio -o | gz
To me, it is also easier to read and modify if the list gets really long.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Memory problems with xargs and grep and pattern from file - bash

Related

Efficient method to parse large number of files

Find last created tar.gz and extract it

Delete corrupt gz archives with "xargs rm"

Scaling up grep find and copy to large folder (xargs?)

How to ignore file types from being archived in my shell script?

Categories

Resources