Efficient method to parse large number of files - shell

I've incoming data which will be in range of 130GBs - 300GBs containing 1000's (maybe millions) of small .txt files of size 2KB - 1MB in a SINGLE folder. I want to parse them efficiently.
I'm looking at the following options (Referred from - 21209029]:
Using printf + xargs (followed by egrep & awk text processing)
printf '%s\0' *.txt | xargs -0 cat | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' > all_in_1.out
Using find + cat (followed by egrep & awk text processing)
find . -name \*.txt -exec cat {} > all_in_1.tmp \;
cat all_in_1.tmp | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' > all_in_1.out
Using for loop
for file in *.txt
do
cat "$file" | egrep -i -v 'pattern1|...|pattern8' | awk '{gsub(/"\t",",")}1' >> all_in_1.out
done
Which one of the above is the most efficient? Is there a better way to do it?
Or is using shell commands not at all recommended to handle this amount of data processing (I do prefer a shell way for this)?
The server has RHEL 6.5 OS with 32 GB memory with 16 Cores (#2.2GHz).

Approach 1 and 3 expand the list of files on the shell command line. This will not work with a huge number of files. Approach 1 and 3 also do not work if the files are distributed across many directories (which is likely with millions of files).
Approach 2 makes a copy of all data, so it is inefficient as well.
You should use find and pass the file names directly to egrep. Use the -h option to suppress the prefix with the file name:
find . -name \*.txt -print0 \
| xargs -0 egrep -i -v -h 'pattern1|...|pattern8' \
| awk '{gsub(/"\t",",")}1' > all_in_1.out
xargs will automatically launch multiple egrep processes in sequence to avoid exceeding the command line limit in a single invocation.
Depending on the file contents, it may also be more efficient to avoid the egrep processes altogether, and do the filtering directly in awk:
find . -name \*.txt -print0 \
| xargs -0 awk 'BEGIN { IGNORECASE = 1 } ! /pattern1|...|pattern8/ {gsub(/"\t",",")}1' > all_in_1.out
BEGIN { IGNORECASE = 1 } corresponds to the -i option of egrep, and the ! inverts the sense of the matching, just like -v. IGNORECASE appears to be a GNU extension.

Related

Refining bash script with multiple find regex sed awk to array and functions that build a report

The following code is working, but it takes too long and everything I've tried to reduce it bombs either due to white spaces, inconsistent access.log syntax or something else.
Any suggestions to help cut down the finds to one find $LOGS -mtime -30 -type f - print0 and grep/sed/awk/sort once compared to multiple finds like this would be appreciated:
find $LOGS -mtime -30 -type f -print0 | xargs -0 grep -B 2 -w "RESULT err=0 tag=97" | grep -w "BIND" | sed '/uid=/!d;s//&\n/;s/.*\n//;:a;/,/bb;$!{n;ba};:b;s//\n&/;P;D' | sed 's/ //g' | sed s/$/,/g |awk '{a[$1]++}END{for(i in a)print i a[i]}' |sort -t , -k 2 -g > $OUTPUT1;
find $LOGS -mtime -30 -type f -print0 | xargs -0 grep -B 2 -w "RESULT err=0 tag=97" | grep -E 'BIND|LDAP connection from*' | sed '/from /!d;s//&\n/;s/.*\n//;:a;/:/bb;$!{n;ba};:b;s//\n&/;P;D' | sed 's/ //g' | sed s/$/,/g |awk '{a[$1]++}END{for(i in a)print i a[i]}' |sort -t , -k 2 -g > $IPAUTH0;
find $LOGS -mtime -30 -type f -print0 | xargs -0 grep -B 2 -w "RESULT err=49 tag=97" | grep -w "BIND" | sed '/uid=/!d;s//&\n/;s/.*\n//;:a;/,/bb;$!{n;ba};:b;s//\n&/;P;D' | sed 's/ //g' | sed s/$/,/g |awk '{a[$1]++}END{for(i in a)print i a[i]}' |sort -t , -k 2 -g > $OUTPUT2;
I've tried: for find | while read -r file; do grep1>output1 grep2>output2 grep3>output3 done and a few others, but cannot seem to get the syntax right and am hoping to cut down the repeats here.
The full script (stripped of some content) can be found here and runs against a Java program I wrote for an email report. NOTE: This runs against access logs in about 60GB of combined text.
I haven't looked closely at the sed/awk/etc section (and they'll be hard to work on without some example data), but you should be able to share the initial scans by greping for lines matching any of the patterns, storing that in a temp file, and then searching just that for the individual patterns. I'd also use find ... -exec instead of find ... | xargs:
tempfile=$(mktemp "${TMPDIR:-/tmp}/logextract.XXXXXX") || {
echo "Error creating temp file" >&2
exit 1
}
find $LOGS -mtime -30 -type f -exec grep -B 2 -Ew "RESULT err=(0|49) tag=97" {} + >"$tempfile"
grep -B 2 -w "RESULT err=0 tag=97" "$tempfile" | grep -w "BIND" | ...
grep -B 2 -w "RESULT err=0 tag=97" "$tempfile" | grep -E 'BIND|LDAP connection from*' | ...
grep -B 2 -w "RESULT err=49 tag=97" "$tempfile" | grep -w "BIND" | ...
rm "$tempfile"
BTW, you probably don't mean to search for LDAP connection from* -- the from* at the end means "fro" followed by 0 or more "m" characters.
A couple of general scripting recommendations: use lower- or mixed-case variables to avoid accidental conflicts with the various all-caps names that have special meanings. (Except when you want the special meaning, e.g. setting PATH.)
Also, putting double-quotes around variable references is generally a good idea to prevent unexpected word splitting and wildcard expansion... except that in some places your script depends on this, like setting LOGS="/log_dump/ldap/c*", and then counting on wildcard expansion happening when the variable is used. In these cases, it's usually better to use a bash array to store each item (e.g. filename) as a separate element:
logs=(/log_dump/ldap/c*) # Wildcard gets expanded when it's defined
...
find "${logs[#]}" -mtime ... # All that syntax gets all array elements in unmangled form
Note that this isn't really needed in cases like this where you know there aren't going to be any unexpected wildcards or spaces in the variable, but when you're dealing with unconstrained data this method is safer. (I work mostly on macOS, where spaces in filenames are just a fact of life, and I've learned the hard way to use script idioms that aren't confused by them.)

Using cat to combine files with numerical names that must be kept in order

I have a number of files (these are randomly generated each time) that have a number in the name – within the file, the number is repeated. Example:
file1_85.txt
file1_242.txt
file1_9.txt
I want to cat the contents of these files into one larger file, file_all.txt.
The code that I tried using is this:
for f in file1_*.txt; do (cat "${f}"; echo " ") >> file_all.txt; done
However, the contents of file_all.txt look like this:
file1_242.txt
file1_85.txt
file1_9.txt
When I really want it to look like this:
file1_9.txt
file1_85.txt
file1_242.txt
Which would happen if bash cat the files in numerical order.
I have tried this:
for f in file1_{1..99999}.txt; do (cat "${f}"; echo " ") >> file_all.txt; done
Which worked, however I got error messages "No such file or directory" when it passed through a number that did not have a matching file. Also, this is very time consuming. Is there a better way to carry out this task?
Assuming the files don't have any newlines in their names, and you have the GNU version of sort, this will work:
while read file; do
cat "$file"
echo
done < <(ls -1 file_*.txt | sort -V) > file_all.txt
If your sort doesn't support -V (as on e.g. OS X), you can take advantage of the filename consistency to do a straight numeric sort instead:
while read file; do
cat "$file"
echo
done < <(ls -1 file_*.txt | sort -t_ -n -k2,2) > file_all.txt
Finally, if your files contain newlines, you can still use sort, but you need to use the -z option in conjunction with other tools that terminate elements of a list with NUL bytes instead of newlines:
find . -depth 1 -name 'file_*' -print0 | sort -zV | xargs -0 -I{} bash -c 'cat {}; echo'
Replace the sort -zV with sort -z -t_ -n -k2,2 for an older version of GNU sort that lacks the -V option; a totally non-GNU sort probably won't have -z either, though.
For filenames potentially containing newlines:
$ find -name 'file1*' -print0 | sort -zV | xargs -0 cat
file1_9
file1_85
file1_242
or, if the -V option is not available,
$ find -name 'file1*' -print0 | sort -z -n -t '_' -k 2 | xargs -0 cat
file1_9
file1_85
file1_242
This uses null separated filenames; the -z option tells sort to expect (and produce) null separated filenames, and xargs -0 is for null separated input as well.
Your "brute force" approach would work if:
$ for f in file1_{1..99999}.txt; do [ -f "${f}" ] && cat "${f}" >> file_all.txt; done
The comparison: [ -f "${f}" ] check if the file exists before cat, avoiding the error message.

From UNIX shell, how to find all files containing a specific string, then print the 4th line of each file?

I want to find all files within the current directory that contain a given string, then print just the 4th line of each file.
grep --null -l "$yourstring" * | # List all the files containing your string
xargs -0 sed -n '4p;q' # Print the fourth line of said files.
Different editions of grep have slightly different incantations of --null, but it's usually there in some form. Read your manpage for details.
Update: I believe one of the null file list incantations of grep is a reasonable solution that will cover the vast majority of real-world use cases, but to be entirely portable, if your version of grep does not support any null output it is not perfectly safe to use it with xargs, so you must resort to find.
find . -maxdepth 1 -type f -exec grep -q "$yourstring" {} \; -exec sed -n '4p;q' {} +
Because find arguments can almost all be used as predicates, the -exec grep -q… part filters the files that are eventually fed to sed down to only those that contain the required string.
From other user:
grep -Frl string . | xargs -n 1 sed -n 4p
Give a try to the below GNU find command,
find . -maxdepth 1 -type f -exec grep -l 'yourstring' {} \; | xargs -I {} awk 'NR==4{print; exit}' {}
It finds all the files in the current directory which contains specific string, and prints the line number 4 present in each file.
This for loop should work:
while read -d '' -r file; do
echo -n "$file: "
sed '4q;d' "$file"
done < <(grep --null -l "some-text" *.txt)

Shell script string search where '#' is not in 1st character position

this string search was provided by Paul.R (much appreciated Paul):
** find dir -type f -print0 | xargs -0 grep -F -f strings.txt **
Note, I am using the above search argument to perform a recursive directory search for hard coded path names within shell scripts. However, due to the limitations of the Unix environment (TRU64) I am unable to use the GREP -r switch to perform my directory search. Hence the use of the solution provided above.
As an additional criteria, I would like to extend this search argument to exclude any text where the first leading character of the string being searched is "#" (comment symbol).
Would appreciate any feedback.
Thanks...Evan
I'm assuming you're just trying to limit the results in the output from the command you posted. If that's so, then how about
find dir -type f -print0 | xargs -0 grep -F -f strings.txt | grep -v '^#'
The final piped command will ignore all lines that match the regex ^# (begins with # char)
This solution will not work if the path and file of the files is contained in strings.txt, but it might work in your situation.
find dir -type f -print0 | xargs -0 grep -v '^#' | grep -F -f strings.txt

Delete all but the most recent X files in bash

Is there a simple way, in a pretty standard UNIX environment with bash, to run a command to delete all but the most recent X files from a directory?
To give a bit more of a concrete example, imagine some cron job writing out a file (say, a log file or a tar-ed up backup) to a directory every hour. I'd like a way to have another cron job running which would remove the oldest files in that directory until there are less than, say, 5.
And just to be clear, there's only one file present, it should never be deleted.
The problems with the existing answers:
inability to handle filenames with embedded spaces or newlines.
in the case of solutions that invoke rm directly on an unquoted command substitution (rm `...`), there's an added risk of unintended globbing.
inability to distinguish between files and directories (i.e., if directories happened to be among the 5 most recently modified filesystem items, you'd effectively retain fewer than 5 files, and applying rm to directories will fail).
wnoise's answer addresses these issues, but the solution is GNU-specific (and quite complex).
Here's a pragmatic, POSIX-compliant solution that comes with only one caveat: it cannot handle filenames with embedded newlines - but I don't consider that a real-world concern for most people.
For the record, here's the explanation for why it's generally not a good idea to parse ls output: http://mywiki.wooledge.org/ParsingLs
ls -tp | grep -v '/$' | tail -n +6 | xargs -I {} rm -- {}
Note: This command operates in the current directory; to target a directory explicitly, use a subshell ((...)) with cd:
(cd /path/to && ls -tp | grep -v '/$' | tail -n +6 | xargs -I {} rm -- {})
The same applies analogously to the commands below.
The above is inefficient, because xargs has to invoke rm separately for each filename.
However, your platform's specific xargs implementation may allow you to solve this problem:
A solution that works with GNU xargs is to use -d '\n', which makes xargs consider each input line a separate argument, yet passes as many arguments as will fit on a command line at once:
ls -tp | grep -v '/$' | tail -n +6 | xargs -d '\n' -r rm --
Note: Option -r (--no-run-if-empty) ensures that rm is not invoked if there's no input.
A solution that works with both GNU xargs and BSD xargs (including on macOS) - though technically still not POSIX-compliant - is to use -0 to handle NUL-separated input, after first translating newlines to NUL (0x0) chars., which also passes (typically) all filenames at once:
ls -tp | grep -v '/$' | tail -n +6 | tr '\n' '\0' | xargs -0 rm --
Explanation:
ls -tp prints the names of filesystem items sorted by how recently they were modified , in descending order (most recently modified items first) (-t), with directories printed with a trailing / to mark them as such (-p).
Note: It is the fact that ls -tp always outputs file / directory names only, not full paths, that necessitates the subshell approach mentioned above for targeting a directory other than the current one ((cd /path/to && ls -tp ...)).
grep -v '/$' then weeds out directories from the resulting listing, by omitting (-v) lines that have a trailing / (/$).
Caveat: Since a symlink that points to a directory is technically not itself a directory, such symlinks will not be excluded.
tail -n +6 skips the first 5 entries in the listing, in effect returning all but the 5 most recently modified files, if any.
Note that in order to exclude N files, N+1 must be passed to tail -n +.
xargs -I {} rm -- {} (and its variations) then invokes on rm on all these files; if there are no matches at all, xargs won't do anything.
xargs -I {} rm -- {} defines placeholder {} that represents each input line as a whole, so rm is then invoked once for each input line, but with filenames with embedded spaces handled correctly.
-- in all cases ensures that any filenames that happen to start with - aren't mistaken for options by rm.
A variation on the original problem, in case the matching files need to be processed individually or collected in a shell array:
# One by one, in a shell loop (POSIX-compliant):
ls -tp | grep -v '/$' | tail -n +6 | while IFS= read -r f; do echo "$f"; done
# One by one, but using a Bash process substitution (<(...),
# so that the variables inside the `while` loop remain in scope:
while IFS= read -r f; do echo "$f"; done < <(ls -tp | grep -v '/$' | tail -n +6)
# Collecting the matches in a Bash *array*:
IFS=$'\n' read -d '' -ra files < <(ls -tp | grep -v '/$' | tail -n +6)
printf '%s\n' "${files[#]}" # print array elements
Remove all but 5 (or whatever number) of the most recent files in a directory.
rm `ls -t | awk 'NR>5'`
(ls -t|head -n 5;ls)|sort|uniq -u|xargs rm
This version supports names with spaces:
(ls -t|head -n 5;ls)|sort|uniq -u|sed -e 's,.*,"&",g'|xargs rm
Simpler variant of thelsdj's answer:
ls -tr | head -n -5 | xargs --no-run-if-empty rm
ls -tr displays all the files, oldest first (-t newest first, -r reverse).
head -n -5 displays all but the 5 last lines (ie the 5 newest files).
xargs rm calls rm for each selected file.
find . -maxdepth 1 -type f -printf '%T# %p\0' | sort -r -z -n | awk 'BEGIN { RS="\0"; ORS="\0"; FS="" } NR > 5 { sub("^[0-9]*(.[0-9]*)? ", ""); print }' | xargs -0 rm -f
Requires GNU find for -printf, and GNU sort for -z, and GNU awk for "\0", and GNU xargs for -0, but handles files with embedded newlines or spaces.
All these answers fail when there are directories in the current directory. Here's something that works:
find . -maxdepth 1 -type f | xargs -x ls -t | awk 'NR>5' | xargs -L1 rm
This:
works when there are directories in the current directory
tries to remove each file even if the previous one couldn't be removed (due to permissions, etc.)
fails safe when the number of files in the current directory is excessive and xargs would normally screw you over (the -x)
doesn't cater for spaces in filenames (perhaps you're using the wrong OS?)
ls -tQ | tail -n+4 | xargs rm
List filenames by modification time, quoting each filename. Exclude first 3 (3 most recent). Remove remaining.
EDIT after helpful comment from mklement0 (thanks!): corrected -n+3 argument, and note this will not work as expected if filenames contain newlines and/or the directory contains subdirectories.
Ignoring newlines is ignoring security and good coding. wnoise had the only good answer. Here is a variation on his that puts the filenames in an array $x
while IFS= read -rd ''; do
x+=("${REPLY#* }");
done < <(find . -maxdepth 1 -printf '%T# %p\0' | sort -r -z -n )
For Linux (GNU tools), an efficient & robust way to keep the n newest files in the current directory while removing the rest:
n=5
find . -maxdepth 1 -type f -printf '%T# %p\0' |
sort -z -nrt ' ' -k1,1 |
sed -z -e "1,${n}d" -e 's/[^ ]* //' |
xargs -0r rm -f
For BSD, find doesn't have the -printf predicate, stat can't output NULL bytes, and sed + awk can't handle NULL-delimited records.
Here's a solution that doesn't support newlines in paths but that safeguards against them by filtering them out:
#!/bin/bash
n=5
find . -maxdepth 1 -type f ! -path $'*\n*' -exec stat -f '%.9Fm %N' {} + |
sort -nrt ' ' -k1,1 |
awk -v n="$n" -F'^[^ ]* ' 'NR > n {printf "%s%c", $2, 0}' |
xargs -0 rm -f
note: I'm using bash because of the $'\n' notation. For sh you can define a variable containing a literal newline and use it instead.
Solution for UNIX & Linux (inspired from AIX/HP-UX/SunOS/BSD/Linux ls -b):
Some platforms don't provide find -printf, nor stat, nor support NUL-delimited records with stat/sort/awk/sed/xargs. That's why using perl is probably the most portable way to tackle the problem, because it is available by default in almost every OS.
I could have written the whole thing in perl but I didn't. I only use it for substituting stat and for encoding-decoding-escaping the filenames. The core logic is the same as the previous solutions and is implemented with POSIX tools.
note: perl's default stat has a resolution of a second, but starting from perl-5.8.9 you can get sub-second resolution with the stat function of the module Time::HiRes (when both the OS and the filesystem support it). That's what I'm using here; if your perl doesn't provide it then you can remove the ‑MTime::HiRes=stat from the command line.
n=5
find . '(' -name '.' -o -prune ')' -type f -exec \
perl -MTime::HiRes=stat -le '
foreach (#ARGV) {
#st = stat($_);
if ( #st > 0 ) {
s/([\\\n])/sprintf( "\\%03o", ord($1) )/ge;
print sprintf( "%.9f %s", $st[9], $_ );
}
else { print STDERR "stat: $_: $!"; }
}
' {} + |
sort -nrt ' ' -k1,1 |
sed -e "1,${n}d" -e 's/[^ ]* //' |
perl -l -ne '
s/\\([0-7]{3})/chr(oct($1))/ge;
s/(["\n])/"\\$1"/g;
print "\"$_\"";
' |
xargs -E '' sh -c '[ "$#" -gt 0 ] && rm -f "$#"' sh
Explanations:
For each file found, the first perl gets the modification time and outputs it along the encoded filename (each newline and backslash characters are replaced with the literals \012 and \134 respectively).
Now each time filename is guaranteed to be single-line, so POSIX sort and sed can safely work with this stream.
The second perl decodes the filenames and escapes them for POSIX xargs.
Lastly, xargs calls rm for deleting the files. The sh command is a trick that prevents xargs from running rm when there's no files to delete.
I realize this is an old thread, but maybe someone will benefit from this. This command will find files in the current directory :
for F in $(find . -maxdepth 1 -type f -name "*_srv_logs_*.tar.gz" -printf '%T# %p\n' | sort -r -z -n | tail -n+5 | awk '{ print $2; }'); do rm $F; done
This is a little more robust than some of the previous answers as it allows to limit your search domain to files matching expressions. First, find files matching whatever conditions you want. Print those files with the timestamps next to them.
find . -maxdepth 1 -type f -name "*_srv_logs_*.tar.gz" -printf '%T# %p\n'
Next, sort them by the timestamps:
sort -r -z -n
Then, knock off the 4 most recent files from the list:
tail -n+5
Grab the 2nd column (the filename, not the timestamp):
awk '{ print $2; }'
And then wrap that whole thing up into a for statement:
for F in $(); do rm $F; done
This may be a more verbose command, but I had much better luck being able to target conditional files and execute more complex commands against them.
If the filenames don't have spaces, this will work:
ls -C1 -t| awk 'NR>5'|xargs rm
If the filenames do have spaces, something like
ls -C1 -t | awk 'NR>5' | sed -e "s/^/rm '/" -e "s/$/'/" | sh
Basic logic:
get a listing of the files in time order, one column
get all but the first 5 (n=5 for this example)
first version: send those to rm
second version: gen a script that will remove them properly
With zsh
Assuming you don't care about present directories and you will not have more than 999 files (choose a bigger number if you want, or create a while loop).
[ 6 -le `ls *(.)|wc -l` ] && rm *(.om[6,999])
In *(.om[6,999]), the . means files, the o means sort order up, the m means by date of modification (put a for access time or c for inode change), the [6,999] chooses a range of file, so doesn't rm the 5 first.
Adaptation of #mklement0's excellent answer with some parameters and without needing to navigate to the folder containing the files to be deleted...
TARGET_FOLDER="/my/folder/path"
FILES_KEEP=5
ls -tp "$TARGET_FOLDER"**/* | grep -v '/$' | tail -n +$((FILES_KEEP+1)) | xargs -d '\n' -r rm --
[Ref(s).: https://stackoverflow.com/a/3572628/3223785 ]
Thanks! 😉
found interesting cmd in Sed-Onliners - Delete last 3 lines - fnd it perfect for another way to skin the cat (okay not) but idea:
#!/bin/bash
# sed cmd chng #2 to value file wish to retain
cd /opt/depot
ls -1 MyMintFiles*.zip > BigList
sed -n -e :a -e '1,2!{P;N;D;};N;ba' BigList > DeList
for i in `cat DeList`
do
echo "Deleted $i"
rm -f $i
#echo "File(s) gonzo "
#read junk
done
exit 0
Removes all but the 10 latest (most recents) files
ls -t1 | head -n $(echo $(ls -1 | wc -l) - 10 | bc) | xargs rm
If less than 10 files no file is removed and you will have :
error head: illegal line count -- 0
To count files with bash
I needed an elegant solution for the busybox (router), all xargs or array solutions were useless to me - no such command available there. find and mtime is not the proper answer as we are talking about 10 items and not necessarily 10 days. Espo's answer was the shortest and cleanest and likely the most unversal one.
Error with spaces and when no files are to be deleted are both simply solved the standard way:
rm "$(ls -td *.tar | awk 'NR>7')" 2>&-
Bit more educational version: We can do it all if we use awk differently. Normally, I use this method to pass (return) variables from the awk to the sh. As we read all the time that can not be done, I beg to differ: here is the method.
Example for .tar files with no problem regarding the spaces in the filename. To test, replace "rm" with the "ls".
eval $(ls -td *.tar | awk 'NR>7 { print "rm \"" $0 "\""}')
Explanation:
ls -td *.tar lists all .tar files sorted by the time. To apply to all the files in the current folder, remove the "d *.tar" part
awk 'NR>7... skips the first 7 lines
print "rm \"" $0 "\"" constructs a line: rm "file name"
eval executes it
Since we are using rm, I would not use the above command in a script! Wiser usage is:
(cd /FolderToDeleteWithin && eval $(ls -td *.tar | awk 'NR>7 { print "rm \"" $0 "\""}'))
In the case of using ls -t command will not do any harm on such silly examples as: touch 'foo " bar' and touch 'hello * world'. Not that we ever create files with such names in real life!
Sidenote. If we wanted to pass a variable to the sh this way, we would simply modify the print (simple form, no spaces tolerated):
print "VarName="$1
to set the variable VarName to the value of $1. Multiple variables can be created in one go. This VarName becomes a normal sh variable and can be normally used in a script or shell afterwards. So, to create variables with awk and give them back to the shell:
eval $(ls -td *.tar | awk 'NR>7 { print "VarName=\""$1"\"" }'); echo "$VarName"
leaveCount=5
fileCount=$(ls -1 *.log | wc -l)
tailCount=$((fileCount - leaveCount))
# avoid negative tail argument
[[ $tailCount < 0 ]] && tailCount=0
ls -t *.log | tail -$tailCount | xargs rm -f
I made this into a bash shell script. Usage: keep NUM DIR where NUM is the number of files to keep and DIR is the directory to scrub.
#!/bin/bash
# Keep last N files by date.
# Usage: keep NUMBER DIRECTORY
echo ""
if [ $# -lt 2 ]; then
echo "Usage: $0 NUMFILES DIR"
echo "Keep last N newest files."
exit 1
fi
if [ ! -e $2 ]; then
echo "ERROR: directory '$1' does not exist"
exit 1
fi
if [ ! -d $2 ]; then
echo "ERROR: '$1' is not a directory"
exit 1
fi
pushd $2 > /dev/null
ls -tp | grep -v '/' | tail -n +"$1" | xargs -I {} rm -- {}
popd > /dev/null
echo "Done. Kept $1 most recent files in $2."
ls $2|wc -l
Modified version of the answer of #Fabien if you want to specify a path. Useful if you're running the script elsewhere.
ls -tr /path/foo/ | head -n -5 | xargs -I% --no-run-if-empty rm /path/foo/%

Resources