bash: intensive rw operations cause to damaged files - bash

I have big txt files (for example, let it be 1 000 000 strings each) and I want to sort them by some field and write data to different output files in several dirs (one input file - one out dir). I can do it simply with awk:
awk '{print $0 >> "dir_"'$i'"/"$1".some_suffix"}' some_file;
if I process files one-by-one it always works well, but if I try to work with many files at the same time, i usually (not always) receive some output files truncated (I know exactly count of fields, it's always the same in my case, so it's easy to find bad files). I use command like
for i in <input_files>; do
awk '{print $0 >> "dir_"'$i'"/"$1".some_suffix"}' < $i &
done
so each process creates files in own out dir. Also I tried to parallelize it with xargs and received the same results - some random files were truncated.
How could this happen? Is it RAM, or filesystem cache problem, any suggestions?
Hardware: RAM is not ECC, processors AMD Opteron 6378. I used ssd (plextor m5s) and tmpfs with ext4 and reiserfs (output files are small)

You are probably running out file descriptors in your awk process, if you check carefully you'll find that maybe the first 1021 (just under a power of 2, check ulimit -n for the limit) unique filenames work. Using print ... >> does not have the same behaviour as in a shell: it leaves the file open.
I assume you are using something more contemporary than a vintage awk, e.g. for GNUs gawk:
https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html
Similarly, when a file or pipe is opened for output, awk remembers the file name or command associated with it, and subsequent writes to the same file or command are appended to the previous writes. The file or pipe stays open until awk exits.
This implies that special steps are necessary in order to read the same file again from the beginning, or to rerun a shell command (rather than reading more output from the same command). The close() function makes these things possible:
close(filename)
Try it with close():
gawk '{
outfile="dir_"'$i'"/"$1".some_suffix"
print $0 >> outfile
close(outfile)
}' some_file;
gawk offers the special ERRNO variable which can be used to catch certain errors, sadly it's not set during output redirection errors, so this condition cannot be easily detected. However, under gawk this condition is detected internally (error EMFILE during an open operation) and it attempts to close a not recently used file descriptor so that it can continue, but this isn't guaranteed to work in every situation.
With gawk, you can use --lint for various run-time checks, including hitting the file-descriptor limit and failure to explicitly close files:
$ seq 1 1050 | gawk --lint '{outfile="output/" $1 ".out"; print $0 >> outfile;}'
gawk: cmd. line:1: (FILENAME=- FNR=1022) warning: reached system limit for open files:
starting to multiplex file descriptors
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1050.out' provided
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1049.out' provided
[...]
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1.out' provided

Related

Making nested awk fault tolerant

I have this command (part of a for loop) that tries to get the start time of a process in seconds and stores it into a variable for later use. The command parses /proc/<pid>/stat. In the context of this command a process can be ephemeral and hence by the time I actually run this command, the directory may no longer exist. This causes the nested awk to fail and cause a syntax error in the outer division command. How can I prevent this?
starttime=$(($(awk '{print $22}' $d/stat) / systick));
$d is /proc/<pid>
I would do like this:
starttime=$(($(awk '{print $22}' $d/stat 2>/dev/null || echo 0) / systick));
Basically if awk fails for some reason, such as the special PID stat file not being there anymore, then it returns 0 to the division.
This way you require little change to your code.
PS: I am assuming that systick will always be something else other than 0.

How to add a shell command and use the result in a Fortran program?

Is it possible to call shell command from a Fortran script?
My problem is that I analyze really big files. These files have a lot of lines, e.g. 84084002 or similar.
I need to know how many lines the file has, before I start the analysis, therefore I usually used shell command: wc -l "filename", and than used this number as a parameter of one variable in my script.
But I would like to call this command from my program and use the number of lines and store it into the variable value.
Since 1984, actually in the 2008 standard but already implemented by most of the commonly-encountered Fortran compilers including gfortran, there is a standard intrinsic subroutine execute_command_line which does, approximately, what the widely-implemented but non-standard subroutine system does. As #MarkSetchell has (almost) written, you could try
CALL execute_command_line('wc -l < file.txt > wc.txt' )
OPEN(unit=nn,file='wc.txt')
READ(nn,*) count
What Fortran doesn't have is a standard way in which to get the number of lines in a file without recourse to the kind of operating-system-dependent workaround above. Other, that is, than opening the file, counting the number of lines, and then rewinding to the start of the file to commence reading.
You should be able to do something like this:
command='wc -l < file.txt > wc.txt'
CALL system(command)
OPEN(unit=nn,file='wc.txt')
READ(nn,*) count
You can output the number of lines to a file (fort.1)
wc -l file|awk '{print $1}' > fort.1
In your Fortran program, you can then store the number of lines to a variable (e.g. count) by reading the file fort.1:
read (1,*) count
then you can loop over the variable count and read your whole file
do 1,count
read (file,*)

clear piped output without interrupting program?

I am running a program (pianobar) piped to a text file, that outputs every second. The resulting file ("pianobarout.txt") needs to be cleared regularly, or it grows to massive proportions. However, I do not want to stop pianobar to clear the file.
I have tried running > pianobarout.txt as well as echo "" > pianobarout.txt, but both cause the system's resources to spike heavily for almost 30 seconds, causing the audio from pianobar to skip. I tried removing the file, but it appears that the file is not recreated after being deleted, and I just lose the pipe.
I'm working from python, so if any library there can help, those are available to me.
Any ideas?
If you are currently redirecting with truncation, like yourprogram > file.txt, try redirecting with appending: yourprogram >> file.txt.
There is a big difference between the two when the output file is truncated.
With appending redirection, data is written to the current end of the file. If you truncate it to 0 bytes, the next write will happen at position 0.
With truncating redirection, data is written wherever the last write left off in the file. If you truncate it to 0 bytes, writes will continue at byte 1073741824 where it last left off.
This results in a sparse file if the filesystem supports it (ext2-4 and most Unix fs do), or a long wait while the file is written out if it doesn't (like fat32). A long wait could also be caused by anything following the file, such as tail -f, which has to potentially catch up by reading a GB of zeroes.
Alternatives include yourprogram | split -b 1G - output-, which will write 1GB each to output-aa, output-ab, etc, letting you delete old files at your leasure.

Smart split file with gzipping each part?

I have a very long file with numbers. Something like output of this perl program:
perl -le 'print int(rand() * 1000000) for 1..10'
but way longer - around hundreds of gigabytes.
I need to split this file into many others. For test purposes, let's assume that 100 files, and output file number is taken by taking module of number with 100.
With normal files, I can do it simply with:
perl -le 'print int(rand() * 1000000) for 1..1000' | awk '{z=$1%100; print > z}'
But I have a problem when I need to compress splitted parts. Normally, I could:
... | awk '{z=$1%100; print | "gzip -c - > "z".txt.gz"}'
But, when ulimit is configured to allow less open files than number of "partitions", awk breaks with:
awk: (FILENAME=- FNR=30) fatal: can't open pipe `gzip -c - > 60.txt.gz' for output (Too many open files)
This doesn't break with normal file output, as GNU awk is apparently smart enough to recycle file handles.
Do you know any way (aside from writing my own stream-splitting-program, implementing buffering, and some sort of pool-of-filehandles management) to handle such case - that is: splitting to multiple files, where access to output files is random, and gzipping all output partitions on the fly?
I didn't write it in question itself, but since the additional information is together with solution, I'll write it all here.
So - the problem was on Solaris. Apparently there is a limitation, that no program using stdio on Solaris can have more than 256 open filehandles ?!
It is described in here in detail. The important point is that it's enough to set one env variable before running my problematic program, and the problem is gone:
export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1

method for merging two files, opinion needed

Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.

Resources