Smart split file with gzipping each part?

Smart split file with gzipping each part? - bash

I have a very long file with numbers. Something like output of this perl program:
perl -le 'print int(rand() * 1000000) for 1..10'
but way longer - around hundreds of gigabytes.
I need to split this file into many others. For test purposes, let's assume that 100 files, and output file number is taken by taking module of number with 100.
With normal files, I can do it simply with:
perl -le 'print int(rand() * 1000000) for 1..1000' | awk '{z=$1%100; print > z}'
But I have a problem when I need to compress splitted parts. Normally, I could:
... | awk '{z=$1%100; print | "gzip -c - > "z".txt.gz"}'
But, when ulimit is configured to allow less open files than number of "partitions", awk breaks with:
awk: (FILENAME=- FNR=30) fatal: can't open pipe `gzip -c - > 60.txt.gz' for output (Too many open files)
This doesn't break with normal file output, as GNU awk is apparently smart enough to recycle file handles.
Do you know any way (aside from writing my own stream-splitting-program, implementing buffering, and some sort of pool-of-filehandles management) to handle such case - that is: splitting to multiple files, where access to output files is random, and gzipping all output partitions on the fly?

I didn't write it in question itself, but since the additional information is together with solution, I'll write it all here.
So - the problem was on Solaris. Apparently there is a limitation, that no program using stdio on Solaris can have more than 256 open filehandles ?!
It is described in here in detail. The important point is that it's enough to set one env variable before running my problematic program, and the problem is gone:
export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1

Related

How to add a shell command and use the result in a Fortran program?

Is it possible to call shell command from a Fortran script?
My problem is that I analyze really big files. These files have a lot of lines, e.g. 84084002 or similar.
I need to know how many lines the file has, before I start the analysis, therefore I usually used shell command: wc -l "filename", and than used this number as a parameter of one variable in my script.
But I would like to call this command from my program and use the number of lines and store it into the variable value.

Since 1984, actually in the 2008 standard but already implemented by most of the commonly-encountered Fortran compilers including gfortran, there is a standard intrinsic subroutine execute_command_line which does, approximately, what the widely-implemented but non-standard subroutine system does. As #MarkSetchell has (almost) written, you could try
CALL execute_command_line('wc -l < file.txt > wc.txt' )
OPEN(unit=nn,file='wc.txt')
READ(nn,*) count
What Fortran doesn't have is a standard way in which to get the number of lines in a file without recourse to the kind of operating-system-dependent workaround above. Other, that is, than opening the file, counting the number of lines, and then rewinding to the start of the file to commence reading.

You should be able to do something like this:
command='wc -l < file.txt > wc.txt'
CALL system(command)
OPEN(unit=nn,file='wc.txt')
READ(nn,*) count

You can output the number of lines to a file (fort.1)
wc -l file|awk '{print $1}' > fort.1
In your Fortran program, you can then store the number of lines to a variable (e.g. count) by reading the file fort.1:
read (1,*) count
then you can loop over the variable count and read your whole file
do 1,count
read (file,*)

bash: intensive rw operations cause to damaged files

I have big txt files (for example, let it be 1 000 000 strings each) and I want to sort them by some field and write data to different output files in several dirs (one input file - one out dir). I can do it simply with awk:
awk '{print $0 >> "dir_"'$i'"/"$1".some_suffix"}' some_file;
if I process files one-by-one it always works well, but if I try to work with many files at the same time, i usually (not always) receive some output files truncated (I know exactly count of fields, it's always the same in my case, so it's easy to find bad files). I use command like
for i in <input_files>; do
awk '{print $0 >> "dir_"'$i'"/"$1".some_suffix"}' < $i &
done
so each process creates files in own out dir. Also I tried to parallelize it with xargs and received the same results - some random files were truncated.
How could this happen? Is it RAM, or filesystem cache problem, any suggestions?
Hardware: RAM is not ECC, processors AMD Opteron 6378. I used ssd (plextor m5s) and tmpfs with ext4 and reiserfs (output files are small)

You are probably running out file descriptors in your awk process, if you check carefully you'll find that maybe the first 1021 (just under a power of 2, check ulimit -n for the limit) unique filenames work. Using print ... >> does not have the same behaviour as in a shell: it leaves the file open.
I assume you are using something more contemporary than a vintage awk, e.g. for GNUs gawk:
https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html
Similarly, when a file or pipe is opened for output, awk remembers the file name or command associated with it, and subsequent writes to the same file or command are appended to the previous writes. The file or pipe stays open until awk exits.
This implies that special steps are necessary in order to read the same file again from the beginning, or to rerun a shell command (rather than reading more output from the same command). The close() function makes these things possible:
close(filename)
Try it with close():
gawk '{
outfile="dir_"'$i'"/"$1".some_suffix"
print $0 >> outfile
close(outfile)
}' some_file;
gawk offers the special ERRNO variable which can be used to catch certain errors, sadly it's not set during output redirection errors, so this condition cannot be easily detected. However, under gawk this condition is detected internally (error EMFILE during an open operation) and it attempts to close a not recently used file descriptor so that it can continue, but this isn't guaranteed to work in every situation.
With gawk, you can use --lint for various run-time checks, including hitting the file-descriptor limit and failure to explicitly close files:
$ seq 1 1050 | gawk --lint '{outfile="output/" $1 ".out"; print $0 >> outfile;}'
gawk: cmd. line:1: (FILENAME=- FNR=1022) warning: reached system limit for open files:
starting to multiplex file descriptors
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1050.out' provided
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1049.out' provided
[...]
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1.out' provided

search&replace on huge txt files

I need a text processing tool that can perform search and replace operations PER LINE on HUGE TEXT FILES (>0.5 GB). Can be either windows or linux based. (I don't know if there is anything like a streamreader/writer in Linux but I have a feeling that it would be the ideal solution. The editors I have tries so far load the whole file into the momory.)
Bonus question: a tool that can MERGE two huge texts on a per line basis, separated with e.g. tabs

Sounds like you want sed. For example,
sed 's/foo/bar/' < big-input-file > big-output-file
should replace the first occurrence of foo by bar in each line of big-input-file, writing the results to big-output-file.
Bonus answer: I just learned about paste, which seems to be exactly what you want for your bonus question.

'sed' is built into Linux/Unix, and is available for Windows. I believe that it only loads a buffer at a time (not the whole file) -- you might try that.
What would you be trying to do with the merge -- interleaved in some way, rather than just concatenating?
Add: interleave.pl
use strict;
use warnings;
my $B;
open INA, $ARGV[0];
open INB, $ARGV[1];
while (<INA>) {
print $_;
$B = <INB>;
print $B;
}
close INA;
close INB;
run: perl interleave.pl fileA fileB > mergedFile
Note that this is a very bare-bones utility. It does not check if the files exist, and it expects that the files have the same number of lines.

I would use perl for this. It is easy to read a file line by line, has great search/repace available using regular expressions, and will enable you to merge, and you can make your perl script aware of both files.

GUNZIP / Extract file "portion by portion"

I'm on a shared server with restricted disk space and i've got a gz file that super expands into a HUGE file, more than what i've got. How can I extract it "portion" by "portion (lets say 10 MB at a time), and process each portion, without extracting the whole thing even temporarily!
No, this is just ONE super huge compressed file, not a set of files please...
Hi David, your solution looks quite elegant, but if i'm readying it right, it seems like every time gunzip extracts from the beginning of the file (and the output of that is thrown away). I'm sure that'll be causing a huge strain on the shared server i'm on (i dont think its "reading ahead" at all) - do you have any insights on how i can make gunzip "skip" the necessary number of blocks?

If you're doing this with (Unix/Linux) shell tools, you can use gunzip -c to uncompress to stdout, then use dd with the skip and count options to copy only one chunk.
For example:
gunzip -c input.gz | dd bs=10485760 skip=0 count=1 >output
then skip=1, skip=2, etc.

Unfortunately I don't know of an existing Unix command that does exactly what you need. You could do it easily with a little program in any language, e.g. in Python, cutter.py (any language would do just as well, of course):
import sys
try:
size = int(sys.argv[1])
N = int(sys.argv[2])
except (IndexError, ValueError):
print>>sys.stderr, "Use: %s size N" % sys.argv[0]
sys.exit(2)
sys.stdin.seek((N-1) * size)
sys.stdout.write(sys.stdin.read(size))
Now gunzip <huge.gz | python cutter.py 1000000 5 > fifthone will put in file fifthone exactly a million bytes, skipping the first 4 million bytes in the uncompressed stream.

How can I trim log files using Perl?

I have recently come up with a situation where I need to trim some rather large log files once they grow beyond a certain size. Everything but the last 1000 lines in each file is disposed of, the job is run every half hour by cron. My solution was to simply run through the list of files, check size and trim if necessary.
for $file (#fileList) {
if ( ((-s $file) / (1024 * 1024)) > $CSize) {
open FH, "$file" or die "Cannot open ${file}: $!\n";
$lineNo = 0;
my #tLines;
while(<FH>) {
push #tLines, $_;
shift #tLines if ++$lineNo < CLLimit;
}
close FH;
open FH, ">$file" or die "Cannot write to ${file}: $!\n";
print FH #tLines;
close FH;
}
This works in the current form but there is a lot of overhead for large log files (especially the ones with 100_000+ lines) because of the need to read in each line and shift if necessary.
Is there any way I could read in just a portion of the file, e.g. in this instance I want to be able to access only the last "CLLimit" lines. Since the script is being deployed on a system that has seen better days (think Celeron 700MHz with 64MB RAM) I am looking for a quicker alternative using Perl.

I realize you're wanting to use Perl, but if this is a UNIX system, why not use the "tail" utility to do the trimming? You could do this in BASH with a very simple script:
if [ `stat -f "%z" "$file"` -gt "$MAX_FILE_SIZE" ]; then
tail -1000 $file > $file.tmp
#copy and then rm to avoid inode problems
cp $file.tmp $file
rm $file.tmp
fi
That being said, you would probably find this post very helpful if you're set on using Perl for this.

Estimate the average length of a line in the log - call it N bytes.
Seek backwards from the end of the file by 1000 * 1.10 * N (10% margin for error in the factor 1.10). Read forward from there, keeping just the most recent 1000 lines.
The question was asked - which function or module?
Built-in function seek looks to me like the tool to use?

Consider simply using the logrotate utility; it is included in most modern Linux distributions. A related tool for BSD systems is called newsyslog. These tools are designed more-or-less for your intended purpose: it atomically moves a log file out of place, creates a new file (with the same name as before) to hold new log entries, instructs the program generating messages to use the new file, and then (optionally) compresses the old file. You can configure how many rotated logs to keep. Here's a potential tutorial:
http://www.debian-administration.org/articles/117
It is not precisely the interface you desire (keeping a certain number of lines) but the program will likely be more robust than what you will cook up on your own; for example, the answers here do not deal with atomically moving the file and notifying the log program to use a new file so there is the risk that some log messages are lost.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Smart split file with gzipping each part? - bash

Related

How to add a shell command and use the result in a Fortran program?

bash: intensive rw operations cause to damaged files

search&replace on huge txt files

GUNZIP / Extract file "portion by portion"

How can I trim log files using Perl?

Categories

Resources