How can I trim log files using Perl?

How can I trim log files using Perl? - performance

I have recently come up with a situation where I need to trim some rather large log files once they grow beyond a certain size. Everything but the last 1000 lines in each file is disposed of, the job is run every half hour by cron. My solution was to simply run through the list of files, check size and trim if necessary.
for $file (#fileList) {
if ( ((-s $file) / (1024 * 1024)) > $CSize) {
open FH, "$file" or die "Cannot open ${file}: $!\n";
$lineNo = 0;
my #tLines;
while(<FH>) {
push #tLines, $_;
shift #tLines if ++$lineNo < CLLimit;
}
close FH;
open FH, ">$file" or die "Cannot write to ${file}: $!\n";
print FH #tLines;
close FH;
}
This works in the current form but there is a lot of overhead for large log files (especially the ones with 100_000+ lines) because of the need to read in each line and shift if necessary.
Is there any way I could read in just a portion of the file, e.g. in this instance I want to be able to access only the last "CLLimit" lines. Since the script is being deployed on a system that has seen better days (think Celeron 700MHz with 64MB RAM) I am looking for a quicker alternative using Perl.

I realize you're wanting to use Perl, but if this is a UNIX system, why not use the "tail" utility to do the trimming? You could do this in BASH with a very simple script:
if [ `stat -f "%z" "$file"` -gt "$MAX_FILE_SIZE" ]; then
tail -1000 $file > $file.tmp
#copy and then rm to avoid inode problems
cp $file.tmp $file
rm $file.tmp
fi
That being said, you would probably find this post very helpful if you're set on using Perl for this.

Estimate the average length of a line in the log - call it N bytes.
Seek backwards from the end of the file by 1000 * 1.10 * N (10% margin for error in the factor 1.10). Read forward from there, keeping just the most recent 1000 lines.
The question was asked - which function or module?
Built-in function seek looks to me like the tool to use?

Consider simply using the logrotate utility; it is included in most modern Linux distributions. A related tool for BSD systems is called newsyslog. These tools are designed more-or-less for your intended purpose: it atomically moves a log file out of place, creates a new file (with the same name as before) to hold new log entries, instructs the program generating messages to use the new file, and then (optionally) compresses the old file. You can configure how many rotated logs to keep. Here's a potential tutorial:
http://www.debian-administration.org/articles/117
It is not precisely the interface you desire (keeping a certain number of lines) but the program will likely be more robust than what you will cook up on your own; for example, the answers here do not deal with atomically moving the file and notifying the log program to use a new file so there is the risk that some log messages are lost.

Related

Bash script that creates files of a set size

I'm trying to set up a script that will create empty .txt files with the size of 24MB in the /tmp/ directory. The idea behind this script is that Zabbix, a monitoring service, will notice that the directory is full and wipe it completely with the usage of a recovery expression.
However, I'm new to Linux and seem to be stuck on the script that generates the files. This is what I've currently written out.
today="$( date +¨%Y%m%d" )"
number=0
while test -e ¨$today$suffix.txt¨; do
(( ++number ))
suffix=¨$( printf -- %02d ¨$number¨ )
done
fname=¨$today$suffix.txt¨
printf ´Will use ¨%s¨ as filename\n´ ¨$fname¨
printf -c 24m /tmp/testf > ¨$fname¨
I'm thinking what I'm doing wrong has to do with the printf command. But some input, advice and/or directions to a guide to scripting are very welcome.
Many thanks,
Melanchole

I guess that it doesn't matter what bytes are actually in that file, as long as it fills up the temp dir. For that reason, the right tool to create the file is dd, which is available in every Linux distribution, often installed by default.
Check the manpage for different options, but the most important ones are
if: the input file, /dev/zero probably which is just an endless stream of bytes with value zero
of: the output file, you can keep the code you have to generate it
count: number of blocks to copy, just use 24 here
bs: size of each block, use 1MB for that

Time complexity of finding duplicate files in bash

I had to write a Bash script to delete duplicate files today, using their md5 hashes. I stored those hashes as files in a temporary directory:
for i in * ; do
hash=$(md5sum /tmp/msg | cut -d " " -f1) ;
if [ -f /tmp/hashes/$hash ] ;
then
echo "Deleted $i" ;
mv $i /tmp/deleted ;
else
touch /tmp/hashes/$hash ;
fi ;
done
It worked perfectly, but led me to wonder: is it a time-efficient way of doing that? I initially thought of storing the MD5 hashes in a file, but then I thought "no, because checking whether a given MD5 is in this file requires to re-read it entirely every time". Now, I wonder: is it the same when using the "create files in a directory" method? Has the Bash [ -f ] check linear, or quasi-constant complexity when there are lots of file in the same directory?
If it depends on the filesystem, what's the complexity on tmpfs?

I'm a fan of using the right tool for the job. In this case, you only want to see duplicate files. I've tested this against several thousand files at my disposal and rereading the file did not seem to have any problems. Plus I noticed that I have hundreds of duplicate files. When I store hashes in separate files and then process this large quantity of files, my system slowly creeps along after about 10,000 hash files in one directory. Having all of the hashes in a single file greatly sped this up.
# This uses md5deep. An alternate is presented later.
md5deep -r some_folder > hashes.txt
# If you do not have md5deep
find . -type f -exec md5sum \{\} \;
This gives you hashes of everything.
cut -b -32 hashes.txt | sort | uniq -d > dupe_hashes.txt
That will use cut to get the hash for each file, sort the hashes, then find any duplicated hashes. Those are written to dupe_hashes.txt without the filenames attached. Now we need to map hashes back to the files.
(for hash in $(cat dupe_hashes.txt); do
grep "^$hash" hashes.txt | tail -n +2 | cut -b 35-
done) > dupe_files.txt
This does not appear to run slowly for me. The Linux kernel does a very good job of keeping files like this in memory instead of reading them from the disk frequently. If you prefer to force this to be in memory, you could just use /dev/shm/hashes.txt instead of hashes.txt. I've found that it was unnecessary in my tests.
That gives you every file that's a duplicate. So far, so good. You'll probably want to review this list. If you want to list the original one as well, remove the tail -n +2 | bit from the command.
When you are comfortable that you can delete every listed file you can pipe things to xargs. This will delete the files in groups of 50.
xargs -L 50 rm < dupe_files.txt

I'll try to qualitatively answer how fast file existence tests are on tmpfs, and then I can suggest how you can make your whole program run faster.
First, tmpfs directory lookups rely (in the kernel) on directory entry cache hash table lookups, which aren't that sensitive to the number of files in your directory. They are affected, but sub-linearly. It has to do with the fact that properly-done hash table lookups take some constant time, O(1), regardless of the number of items in the hash table.
To explain, we can look at the work that is done by test -f, or [ -f X ], from coreutils (gitweb):
case 'e':
unary_advance ();
return stat (argv[pos - 1], &stat_buf) == 0;
...
case 'f': /* File is a file? */
unary_advance ();
/* Under POSIX, -f is true if the given file exists
and is a regular file. */
return (stat (argv[pos - 1], &stat_buf) == 0
&& S_ISREG (stat_buf.st_mode));
So it uses stat() on the filename directly. No directory listing is done explicitly by test, but the runtime of stat may be affected by the number of files in the directory. The completion time for the stat call will depend on the unterlying filesystem implementation.
For every filesystem, stat will split up the path into directory components, and walk it down. For instance, for the path /tmp/hashes/the_md5: first /, gets its inode, then looks up tmp inside it, gets that inode (it's a new mountpoint), then gets hashes inode, and finally then the test filename and its inode. You can expect the inodes all the way to /tmp/hashes/ to be cached because they are repeated at each iteration, so those lookups are fast and likely don't require disk access. Each lookup will depend on the filesystem the parent directory is on. After the /tmp/ portion, lookups happen on tmpfs (which is all in memory, except if you ever run out of memory and need to use swap).
tmpfs in linux relies on simple_lookup to obtain the inode of a file in a directory. tmpfs is located under its old name in the tree linux mm/shmem.c . tmpfs, much like ramfs, doesn't seem to be implementing data structures of its own to keep track of virtual data, it simply relies on VFS directory entry caches (under Directory Entry Caches).
Therefore, I suspect the lookup for a file's inode, in a directory, is as simple as a hash table lookup. I'd say that as long as all your temporary files fit in your memory, and you use tmpfs/ramfs, it doesn't matter how many files are there -- it's O(1) lookup each time.
Other filesystems like Ext2/3, however, will incur a penalty linear with the number of files present in the directory.
storing them in memory
As others have suggested, you may also store MD5s in memory by storing them in bash variables, and avoid the filesystem (and associated syscall) penalties. Storing them on a filesystem has the advantage that you could resume from where you left if you were to interrupt your loop (your md5 could be a symlink to the file whose digest matches, which you could rely on, on subsequent runs), but is slower.
MD5=d41d8cd98f00b204e9800998ecf8427e
let SEEN_${MD5}=1
...
digest=$(md5hash_of <filename>)
let exists=SEEN_$digest
if [[ "$exists" == 1 ]]; then
# already seen this file
fi
faster tests
And you may use [[ -f my_file ]] instead of [ -f my_file ]. The command [[ is a bash built-in, and is much faster than spawning a new process (/usr/bin/[) for each comparison. It will make an even bigger difference.
what is /usr/bin/[
/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):
#define LBRACKET 1
#include "test.c"
so they're interchangeable.

The choice between reading the contents of a file containing hashes and finding a hash in a directory of filenames that are the hashes basically comes down to "is the kernel quicker at reading a directory or your program at reading a file". Both are going to involve a linear search for each hash, so you end up with much the same behaviour. You can probably argue that the kernel should be a little quicker, but the margin won't be great. Note that most often, the linear search will be exhaustive because the hash won't exist (unless you have lots of duplicate files). So, if you're processing a few thousand files, the searches will process a few million entries overall — it is quadratic behaviour.
If you have many hundreds or thousands of files, you'd probably do better with a two-level hierarchy — for example, a directory containing two-character sub-directories 00 .. FF, and then storing the rest of the name (or the full name) in the sub-directory. A minor variation of this technique is used in the terminfo directories, for example. The advantage is that the kernel only has to read relatively small directories to find whether the file is present or not.

I haven't "hashed" this out, but I'd try storing your md5sums in a bash hash.
See How to define hash tables in Bash?
Store the md5sum as the key, and if you want, the filename as the value. For each file, just see if the key already exists in the hash table. If so, you don't care about the value, but could use it to print out the original duplicate file's name. Then delete the current file (with the duplicate key). Not being a bash expert that's where I'd start looking.

Shell scripting - save to a file, so the file will always have the last 10 values added

I found myself quite stomped. I am trying to output data from a script to a file.
Altho I need to keep only the last 10 values, so the append won't work.
The main script returns one line; so I save it to a file. I use tail to get the last 10 lines and process them, but then I get to the point where the file is too big, due the fact that I continue to append lines to it (the script output a line every minute or so, which bring up the size of the log quite fast.
I would like to limit the number of writes that I do on that script, so I can always have only the last 10 lines, discarding the rest.
I have thought about different approaches, but they all involve a lot of activity, like create temp files, delete the original file and create a new file, with just the tail of the last 10 entry; but it feels so un-elegant and very amateurish.
Is there a quick and clean way to query a file, so I can add lines until I hit 10 lines, and then start to delete the lines in chronological order, and add the new ones on the bottom?
Maybe things are easier than what I think, and there is a simple solution that I cannot see.
Thanks!

In general, it is difficult to remove data from the start of a file. The only way to do it is to overwrite the file with the tail that you wish to keep. It isn't that ugly to write, though. One fairly reasonable hack is to do:
{ rm file; tail -9 > file; echo line 10 >> file; } < file
This will retain the last 9 lines and add a 10th line. There is a lot of redundancy, so you might like to do something like:
append() { test -f $1 && { rm $1; tail -9 > $1; } < $1; cat >> $1; }
And then invoke it as:
echo 'the new 10th line' | append file
Please note that this hack of using redirecting input to the same file as the later output is a bit fragile and obscure. It is entirely possible for the script to be interrupted and delete the file! It would be safer and more maintainable to explicitly use a temporary file.

Smart split file with gzipping each part?

I have a very long file with numbers. Something like output of this perl program:
perl -le 'print int(rand() * 1000000) for 1..10'
but way longer - around hundreds of gigabytes.
I need to split this file into many others. For test purposes, let's assume that 100 files, and output file number is taken by taking module of number with 100.
With normal files, I can do it simply with:
perl -le 'print int(rand() * 1000000) for 1..1000' | awk '{z=$1%100; print > z}'
But I have a problem when I need to compress splitted parts. Normally, I could:
... | awk '{z=$1%100; print | "gzip -c - > "z".txt.gz"}'
But, when ulimit is configured to allow less open files than number of "partitions", awk breaks with:
awk: (FILENAME=- FNR=30) fatal: can't open pipe `gzip -c - > 60.txt.gz' for output (Too many open files)
This doesn't break with normal file output, as GNU awk is apparently smart enough to recycle file handles.
Do you know any way (aside from writing my own stream-splitting-program, implementing buffering, and some sort of pool-of-filehandles management) to handle such case - that is: splitting to multiple files, where access to output files is random, and gzipping all output partitions on the fly?

I didn't write it in question itself, but since the additional information is together with solution, I'll write it all here.
So - the problem was on Solaris. Apparently there is a limitation, that no program using stdio on Solaris can have more than 256 open filehandles ?!
It is described in here in detail. The important point is that it's enough to set one env variable before running my problematic program, and the problem is gone:
export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1

GUNZIP / Extract file "portion by portion"

I'm on a shared server with restricted disk space and i've got a gz file that super expands into a HUGE file, more than what i've got. How can I extract it "portion" by "portion (lets say 10 MB at a time), and process each portion, without extracting the whole thing even temporarily!
No, this is just ONE super huge compressed file, not a set of files please...
Hi David, your solution looks quite elegant, but if i'm readying it right, it seems like every time gunzip extracts from the beginning of the file (and the output of that is thrown away). I'm sure that'll be causing a huge strain on the shared server i'm on (i dont think its "reading ahead" at all) - do you have any insights on how i can make gunzip "skip" the necessary number of blocks?

If you're doing this with (Unix/Linux) shell tools, you can use gunzip -c to uncompress to stdout, then use dd with the skip and count options to copy only one chunk.
For example:
gunzip -c input.gz | dd bs=10485760 skip=0 count=1 >output
then skip=1, skip=2, etc.

Unfortunately I don't know of an existing Unix command that does exactly what you need. You could do it easily with a little program in any language, e.g. in Python, cutter.py (any language would do just as well, of course):
import sys
try:
size = int(sys.argv[1])
N = int(sys.argv[2])
except (IndexError, ValueError):
print>>sys.stderr, "Use: %s size N" % sys.argv[0]
sys.exit(2)
sys.stdin.seek((N-1) * size)
sys.stdout.write(sys.stdin.read(size))
Now gunzip <huge.gz | python cutter.py 1000000 5 > fifthone will put in file fifthone exactly a million bytes, skipping the first 4 million bytes in the uncompressed stream.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio