awk Splitting huge file creates error "too many open files" [duplicate] - bash

I am just splitting a very large csv file in to parts. When ever i run the following command. the doesn't completely split rather returns me the following error. how can i avoid the split the whole file.
awk -F, '{print > $2}' test1.csv
awk: YY1 makes too many open files
input record number 31608, file test1.csv
source line number 1

Just close the files after writing:
awk -F, '{print > $2; close($2)}' test1.csv

You must have a lot of lines. Are you sure that the second row repeats enough to put those records into an individual file? Anyway, awk is holding the files open until the end. You'll need a process that can close the file handles when not in use.
Perl to the rescue. Again.
#!perl
while( <> ) {
#content = split /,/, $_;
open ( OUT, ">> $content[1]") or die "whoops: $!";
print OUT $_;
close OUT;
}
usage: script.pl your_monster_file.csv
outputs the entire line into a file named the same as the value of the second CSV column in the current directory, assuming no quoted fields etc.

Related

Fastest way -- Appending a line to a file only if it does not already exist

given this question Appending a line to a file only if it does not already exist
is there a faster way than the solution provided by #drAlberT?
grep -q -F 'string' foo.bar || echo 'string' >> foo.bar
I have implemented the above solution and I have to iterate it over a 500k lines file (i.e. check if a line is not already in a 500k lines set). Moreover, I've to run this process for a lot of times, maybe 10-50 million times. Needless to say it's kind of slow as it takes 25-30ms to run on my server (so 3-10+ days of runtime in total).
EDIT: the flow is the following: I have a file with 500k lines, each time I run, I get maybe 10-30 new lines and I check if they are already there or not. If not I add them, then I repeat many times. The order of my 500k lines files is important as I'm going through it with another process.
EDIT2: the 500k lines file is always containing unique lines, and I only care about "full lines", no substrings.
Thanks a lot!
Few suggested improvements:
Try using awk instead of grep so that you can both detect the string and write it in one action;
If you do use grep don't use a Bash loop to feed each potential match to grep and then append that one word to the file. Instead, read all the potential lines into grep as matches (using -f file_name) and print the matches. Then invert the matches and append the inverted match. See last pipeline here;
Exit as soon as you see the string (for a single string) rather than continuing to loop over a big file;
Don't call the script millions of times with one or just a few lines -- organize the glue script (in Bash I suppose) so that the core script is called once or a few times with all the lines instead;
Perhaps use multicores since the files are not dependent on each other. Maybe with GNU Parallel (or you could use Python or Ruby or Perl that has support for threads).
Consider this awk for a single line to add:
$ awk -v line=line_to_append 'FNR==NR && line==$0{f=1; exit}
END{if (!f) print line >> FILENAME}' file
Or for multiple lines:
$ awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines file
Some timings using a copy of the Unix words file (235,886 lines) with a five line lines file that has two overlaps:
$ echo "frob
knob
kabbob
stew
big slob" > lines
$ time awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines words
real 0m0.056s
user 0m0.051s
sys 0m0.003s
$ tail words
zythum
Zyzomys
Zyzzogeton
frob
kabbob
big slob
Edit 2
Try this as being the best of both:
$ time grep -x -f lines words |
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines >> words
real 0m0.012s
user 0m0.010s
sys 0m0.003s
Explanation:
grep -x -f lines words find the lines that ARE in words
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines invert those into lines that are NOT in words
>> words append those to the file
Turning the millions of passes over the file into a script with millions of actions will save you a lot of overhead. Searching for a single label at each pass over the file is incredibly inefficient; you can search for as many labels as you can comfortably fit into memory in a single pass over the file.
Something along the following lines, perhaps.
awk 'NR==FNR { a[$0]++; next }
$0 in a { delete a[$0] }
1
END { for (k in a) print k }' strings bigfile >bigfile.new
If you can't fit strings in memory all at once, splitting that into suitable chunks will obviously allow you to finish this in as many passes as you have chunks.
On the other hand, if you have already (effectively) divided the input set into sets of 10-30 labels, you can obviously only search for those 10-30 in one pass. Still, this should provide you with a speed improvement on the order of 10-30 times.
This assumes that a "line" is always a full line. If the label can be a substring of a line in the input file, or vice versa, this will need some refactoring.
If duplicates are not valid in the file, just append them all and filter out the duplicates:
cat myfile mynewlines | awk '!n[$0]++' > mynewfile
This will allow appending millions of lines in seconds.
If order additionally doesn't matter and your files are more than a few gigabytes, you can use sort -u instead.
Have the script read new lines from stdin after consuming the original file. All lines are stored in an associative array (without any compression such as md5sum).
Appending the suffix 'x' is targeted to handle inputs such as '-e'; better ways probably exist.
#!/bin/bash
declare -A aa
while read line; do aa["x$line"]=1;
done < file.txt
while read line; do
if [ x${aa[$line]} == x ]; then
aa[$line]=1;
echo "x$line" >> file.txt
fi
done

Converting from tsv to fasta

I have a bunch of TSV files in my folder and for everyone one of them I would like to get a fasta file where the header after the sign '>' is the name of the file.
My TSV file has 5 columns without header:
Thus:
inputfile called: "A.coseq.table_headless.tsv"
HIV1B-pol-seed 15 MAX 1959 GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
output file called "A.fasta"
>A_MAX
GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
I want to run the script simultaneously in bash for all the files and I have this script who does not work because in awk print statement I have a curly brace:
for sample in `ls *coseq.table_headless.tsv`
do
base1=$(basename $sample "coseq.table_headless.tsv")
awk '{print ">"${base1}"_"$3"\n"$5}' ${base1}coseq.table_headless.tsv > ${base1}fasta
done
Any idea how to correct this code?
Thank you very much
if the basename is the part until the first ".", you can get rid of the loop as well.
awk '{split(FILENAME,base,".");
print ">" base[1] "_" $3 "\n" $5 > base[1]".fasta"}' *coseq.table_headless.tsv
The other solutions posted so far have a few issues:
not closing the files as they're written will produce "too many
open files" errors unless you use GNU awk,
calculating the output file name every time a line is
read rather than once when the input file is opened is inefficient, and
using parenthesized expression on the right side of output
redirection is undefined behavior and so will only work in some awks
(including GNU awk).
This will work robustly and efficiently in all awks:
awk '
FNR==1 { close(out); f=FILENAME; sub(/\..*/,"",f); pfx=">"f"_"; out=f".fasta" }
{ print pfx $3 ORS $5 > out }
' *coseq.table_headless.tsv
Another awk solution:
awk '{ pfx=substr(FILENAME,1,index(FILENAME,".")-1);
printf(">%s_%s\n%s\n",pfx,$3,$5) > pfx".fasta" }' *coseq.table_headless.tsv
pfx contains the first part of filename (till the 1st .)

performance issues in shell script

I have a 200 MB tab separated text file with millions of rows. In this file, I have a column with multiple locations like US , UK , AU etc.
Now I want to break this file on the basis of this column. Though this code is working fine for me, but facing performance issue as it is taking more than 1 hour to split the file into multiple files based on locations. Here is the code:
#!/bin/bash
read -p "Please enter the file to split " file
read -p "Enter the Col No. to split " col_no
#set -x
header=`head -1 $file`
cnt=1
while IFS= read -r line
do
if [ $((cnt++)) -eq 1 ]
then
echo "$line" >> /dev/null
else
loc=`echo "$line" | cut -f "$col_no"`
f_name=`echo "file_"$loc".txt"`
if [ -f "$f_name" ]
then
echo "$line" >> "$f_name";
else
touch "$f_name";
echo "file $f_name created.."
echo "$line" >> "$f_name";
sed -i '1i '"$header"'' "$f_name"
fi
fi
done < $file
The logic applied here is that we are reading the entire file only once, and depending on the locations, we are creating and appending the data to it.
Please suggest necessary improvements in the code to enhance its performance.
Following is a sample data and is separated by colon instead of tab. The country code is in the 4th column:
ID1:ID2:ID3:ID4:ID5
100:abcd:TEST1:ZA:CCD
200:abcd:TEST2:US:CCD
300:abcd:TEST3:AR:CCD
400:abcd:TEST4:BE:CCD
500:abcd:TEST5:CA:CCD
600:abcd:TEST6:DK:CCD
312:abcd:TEST65:ZA:CCD
1300:abcd:TEST4153:CA:CCD
There are a couple of things to bear in mind:
Reading files using while read is slow
Creating subshells and executing external processes is slow
This is a job for a text processing tool, such as awk.
I would suggest that you used something like this:
# save first line
NR == 1 {
header = $0
next
}
{
filename = "file_" $col ".txt"
# if country code has changed
if (filename != prev) {
# close the previous file
close(prev)
# if we haven't seen this file yet
if (!(filename in seen)) {
print header > filename
}
seen[filename]
}
# print whole line to file
print >> filename
prev = filename
}
Run the script using something along the following lines:
awk -v col="$col_no" -f script.awk file
where $col_no is a shell variable containing the column number with the country codes.
If you don't have too many different country codes, you can get away with leaving all the files open, in which case you can remove the call to close(filename).
You can test the script on the sample provided in the question like this:
awk -F: -v col=4 -f script.awk file
Note that I've added -F: to change the input field separator to :.
I think Tom is on the right track, but I'd simplify this a little.
Awk is magical in some ways. One of those ways is that it will keep all its input and output file handles open unless you explicitly close them. So if you create a variable containing an output file name, you can simply redirect to your variable and trust that awk will send the data to the place you've specified and eventually close the output file when it runs out of input to process.
(N.B. an extension of this magic is that in addition to redirects, you can maintain multiple PIPES. Imagine if you were to cmd="gzip -9 > file_"$4".txt.gz"; print | cmd)
The following splits your file without adding a header to each output file.
awk -F: 'NR>1 {out="file_"$4".txt"; print > out}' inp.txt
If adding the header is important, a little more code is required. But not much.
awk -F: 'NR==1{h=$0;next} {out="file_"$4".txt"} !(out in files){print h > out; files[out]} {print > out}' inp.txt
Or, because this one-liner is now a bit long, we can split it out for explanation:
awk -F: '
NR==1 {h=$0;next} # Capture the header
{out="file_"$4".txt"} # Capture the output file
!(out in files){ # If we haven't seen this output file before,
print h > out; # print the header to it,
files[out] # and record the fact that we've seen it.
}
{print > out} # Finally, print our line of input.
' inp.txt
I tested these two scripts successfully on the input data you provided in your question. With this type of solution, there is no need to sort your input data -- your output in each file will be in the order in which that subset's records appeared in your input data.
Note: different versions of awk will permit you to open different numbers of open files. GNU awk (gawk) has a limit in the thousands -- significantly more than the number of countries you might have to deal with. BSD awk version 20121220 (in FreeBSD) appears to run out after 21117 files. BSD awk version 20070501 (in OS X El Capitan) is limited to 17 files.
If you're not confident in your potential number of open files, you can experiment with your version of awk usig something like this:
mkdir -p /tmp/i
awk '{o="/tmp/i/file_"NR".txt"; print "hello" > o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
You can also test the number of open pipes:
awk '{o="cat >/dev/null; #"NR; print "hello" | o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
(If you have a /dev/yes or something that just spits out lines of text ad nauseam, that would be better than using /dev/random for input.)
I haven't previously come across this limit in my own awk programming because when I've needed to create many many output files, I've always used gawk. :-P

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

How to process many csv files into many separate dat files

I have several thousand csv files I wish to reformat. They all have a standard filename with incremental integer, eg. file_1.csv, file_2.csv, file_3.csv, and they all have the same format:
CH1
s,Volts
-1e-06,-0.0028,
-9.998e-07,-0.0032,
-9.99e-07,-0.0036,
For 10,002 lines. I want to remove the header, and I want to separate the two columns into separate files. I have the following code which produces the results I want when I consider a single input file:
tail -10000 file_1.csv |
awk -F, '{print $1 > "s.dat"; print $2 > "Volts.dat"}'
However, I want something that will produce the equivalent files for each csv file, say, replace s.dat with s_$i.dat or similar, but I'm not sure how to go about this, and how to call in each separate csv file in a loop rather than explicitly stating it as file_1.csv.
awk to the rescue!
awk -F, 'FNR>2{print $1 > "s_"FILENAME".dat";
print $2 > "Volts_"FILENAME".dat"}' file*
or reading the filename from the data files
$ awk -F, 'FNR==2{s="_"FILENAME".dat";h1=$1s;h2=$2s}
FNR>2{print $1 > h1; print $2 > h2}' file*

Resources