performance issues in shell script - bash

I have a 200 MB tab separated text file with millions of rows. In this file, I have a column with multiple locations like US , UK , AU etc.
Now I want to break this file on the basis of this column. Though this code is working fine for me, but facing performance issue as it is taking more than 1 hour to split the file into multiple files based on locations. Here is the code:
#!/bin/bash
read -p "Please enter the file to split " file
read -p "Enter the Col No. to split " col_no
#set -x
header=`head -1 $file`
cnt=1
while IFS= read -r line
do
if [ $((cnt++)) -eq 1 ]
then
echo "$line" >> /dev/null
else
loc=`echo "$line" | cut -f "$col_no"`
f_name=`echo "file_"$loc".txt"`
if [ -f "$f_name" ]
then
echo "$line" >> "$f_name";
else
touch "$f_name";
echo "file $f_name created.."
echo "$line" >> "$f_name";
sed -i '1i '"$header"'' "$f_name"
fi
fi
done < $file
The logic applied here is that we are reading the entire file only once, and depending on the locations, we are creating and appending the data to it.
Please suggest necessary improvements in the code to enhance its performance.
Following is a sample data and is separated by colon instead of tab. The country code is in the 4th column:
ID1:ID2:ID3:ID4:ID5
100:abcd:TEST1:ZA:CCD
200:abcd:TEST2:US:CCD
300:abcd:TEST3:AR:CCD
400:abcd:TEST4:BE:CCD
500:abcd:TEST5:CA:CCD
600:abcd:TEST6:DK:CCD
312:abcd:TEST65:ZA:CCD
1300:abcd:TEST4153:CA:CCD

There are a couple of things to bear in mind:
Reading files using while read is slow
Creating subshells and executing external processes is slow
This is a job for a text processing tool, such as awk.
I would suggest that you used something like this:
# save first line
NR == 1 {
header = $0
next
}
{
filename = "file_" $col ".txt"
# if country code has changed
if (filename != prev) {
# close the previous file
close(prev)
# if we haven't seen this file yet
if (!(filename in seen)) {
print header > filename
}
seen[filename]
}
# print whole line to file
print >> filename
prev = filename
}
Run the script using something along the following lines:
awk -v col="$col_no" -f script.awk file
where $col_no is a shell variable containing the column number with the country codes.
If you don't have too many different country codes, you can get away with leaving all the files open, in which case you can remove the call to close(filename).
You can test the script on the sample provided in the question like this:
awk -F: -v col=4 -f script.awk file
Note that I've added -F: to change the input field separator to :.

I think Tom is on the right track, but I'd simplify this a little.
Awk is magical in some ways. One of those ways is that it will keep all its input and output file handles open unless you explicitly close them. So if you create a variable containing an output file name, you can simply redirect to your variable and trust that awk will send the data to the place you've specified and eventually close the output file when it runs out of input to process.
(N.B. an extension of this magic is that in addition to redirects, you can maintain multiple PIPES. Imagine if you were to cmd="gzip -9 > file_"$4".txt.gz"; print | cmd)
The following splits your file without adding a header to each output file.
awk -F: 'NR>1 {out="file_"$4".txt"; print > out}' inp.txt
If adding the header is important, a little more code is required. But not much.
awk -F: 'NR==1{h=$0;next} {out="file_"$4".txt"} !(out in files){print h > out; files[out]} {print > out}' inp.txt
Or, because this one-liner is now a bit long, we can split it out for explanation:
awk -F: '
NR==1 {h=$0;next} # Capture the header
{out="file_"$4".txt"} # Capture the output file
!(out in files){ # If we haven't seen this output file before,
print h > out; # print the header to it,
files[out] # and record the fact that we've seen it.
}
{print > out} # Finally, print our line of input.
' inp.txt
I tested these two scripts successfully on the input data you provided in your question. With this type of solution, there is no need to sort your input data -- your output in each file will be in the order in which that subset's records appeared in your input data.
Note: different versions of awk will permit you to open different numbers of open files. GNU awk (gawk) has a limit in the thousands -- significantly more than the number of countries you might have to deal with. BSD awk version 20121220 (in FreeBSD) appears to run out after 21117 files. BSD awk version 20070501 (in OS X El Capitan) is limited to 17 files.
If you're not confident in your potential number of open files, you can experiment with your version of awk usig something like this:
mkdir -p /tmp/i
awk '{o="/tmp/i/file_"NR".txt"; print "hello" > o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
You can also test the number of open pipes:
awk '{o="cat >/dev/null; #"NR; print "hello" | o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
(If you have a /dev/yes or something that just spits out lines of text ad nauseam, that would be better than using /dev/random for input.)
I haven't previously come across this limit in my own awk programming because when I've needed to create many many output files, I've always used gawk. :-P

Related

Is there a way to take an input that behaves like a file in bash?

I have a task where I'm given an input of the format:
4
A CS 22 M
B ECE 23 M
C CS 23 F
D CS 22 F
as the user input from the command line. From this, we have to perform tasks like determine the number of male and female students, determine which department has the most students, etc. I have done this using awk with the input as a file. Is there any way to do this with a user input instead of a file?
Example of a command I used for a file (where the content in the file is in the same format):
numberofmales=$(awk -F ' ' '{print $4}' file.txt | grep M | wc -l) #list number of males
Not Reproducible
It works fine for me, so your problem can't be reproduced with either GNU or BSD awk under Bash 5.0.18(1). With your posted code and file sample:
$ numberofmales=$(awk -F ' ' '{print $4}' file.txt | grep M | wc -l)
$ echo $numberofmales
2
Check to make sure you don't have problems in your input file, or elsewhere in your code.
Also, note that if you call awk without a file argument or input from a pipe, it tries to collect data from standard input. It may not actually be hanging; it's probably just waiting on end-of-file, which you can trigger with CTRL+D.
Recommended Improvements
Even if your code works, it can be improved. Consider the following, which skips the unnecessary field-separator definition and performs all the actions of your pipeline within awk.
males=$(
awk 'tolower($4)=="m" {count++}; END {print count}' file.txt
)
echo "$males"
Fewer moving parts are often easier to debug, and can often be more performant on large datasets. However, your mileage may vary.
User Input
If you want to use user input rather than a file, you can use standard input to collect your data, and then pass it as a quoted argument to a function. For example:
count_males () {
awk 'tolower($4)=="m" {count++}; END {print count}' <<< "$*"
}
echo "Enter data (CTRL-D when done):"
data=$(cat -)
# If at command prompt, wait until EOF above before
# pasting this line. Won't matter in scripts.
males=$(count_males "$data")
The result is now stored in males, and you can echo "$males" or make use of the variable in whatever other way you like.
Bash indeed does not care whether a file handle is connected to standard input or to a file, and neither does Awk.
However, if you want to pass the same input to multiple Awk instances, it really does make sense to store it in a temporary file.
A better overall solution is to write a better Awk script so you only need to read the input once.
awk 'NF > 1 { ++a[$4] } END { for (g in a) print g, a[g] }'
Demo: https://ideone.com/0ML7Xk
The NF > 1 condition is to skip the silly first line. Probably don't put that information there in the first place and let Awk figure out how many lines there are; it's probably better at counting than you are anyway.

Batch create files with name and content based on input file

I am a mac OS user trying to batch create a bunch of files. I have a text file with column of several hundred terms/subjects, eg:
hydrogen
oxygen
nitrogen
carbon
etcetera
I want to programmatically fill a directory with text files generated from this subject list. For example, "hydrogen.txt" and "oxygen.txt" and so on, with each file created by iterating through the lines of my list_of_names.txt file. Some lines are one word, but other lines are two or three words (eg: "carbon monoxide"). This I have figured out how to do:
awk 'NF>0' list_of_names.txt | while read line; do touch "${line}.txt"; done
Additionally I need to create two lines of content within each of these files, and the content is both static and dynamic...
# filename
#elements/filename
...where in the example above the pound sign ("#") and "elements/" would be the same in all of the files created, but "filename" would be variable (eg: "hydrogen" for "hydrogen.txt" and "oxygen" for "oxygen.txt" etc). One further wrinkle is that if any spaces appear at all on the second line of content, there needs to be a trailing pound sign. For example:
# filename
#elements/carbon monoxide#
...although this last part is not a dealbreaker and I can use grep to modify list_of_names.txt such that phrases like "carbon monoxide" become "carbon_monoxide" and just deal with the repercussions of this later. (But if it is easy to preserve the spaces, I would prefer that.)
After a couple hours of searching and attempts to use sed, awk, and so on I am stuck at a directory full of files with the correct filename.txt format, but I can't get further that this. Mostly I think my efforts are failing because the solutions I can find for doing something like this are using commands I am not familiar with and they are structured for GNU and don't execute correctly in Terminal on Mac OS.
I am amenable to processing this in multiple steps (ie make all of the files.txt first, then run a second step to populate the content of the files), or as a single command that makes the files and all of their content simultaneously ('simultaneously' from a human timescale).
My horrible pseudocode (IN CAPS) for how this would look as 2 steps:
awk 'NF>0' list_of_names.txt | while read line; do touch "${line}.txt"; done
awk 'NF>0' list_of_names.txt | while read line; OPEN "${line}.txt" AND PRINT "# ${line}\n#elements/${line}"; IF ${line} CONTAINS CHARACTER " " PRINT "#"; done
You could use a simple Bash loop and create the files in one shot:
#!/bin/bash
while read -r name; do # loop through input file content
[[ $name ]] || continue # skip empty lines
output=("# $name") # initialize the array with first element
trailing=
[[ $name = *" "* ]] && trailing="#" # name has spaces in it
output+=("#elements/$name$trailing") # name doesn't have a space
printf '%s\n' "${output[#]}" > "$name.txt" # write array content to the output file
done < list_of_names.txt
Doing it in awk:
awk '
NF {
trailing = (/ / ? "#" : "")
out=$0".txt"
printf("# %s\n#elements/%s%s\n", $0, $0, trailing) > out
close(out)
}
' list_of_names.txt
Doing the whole job in awk will yield better performance than in bash, which isn't really suited to processing text like this.
It seems to me that this should cover the requirements you've specified:
awk '
{
out=$0 ".txt"
printf "# %s\n#elements/%s%s\n", $0, $0, (/ / ? "#" : "") >> out
close(out)
}
' list_of_subjects.txt
Though you could shrink it to a one-liner:
awk '{printf "# %s\n# elements/%s%s\n",$0,$0,(/ /?"#":"")>($0".txt");close($0".txt")}' list_of_subjects.txt

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

getting the last opened file

input file:
wtf.txt|/Users/jaro/documents/inc/face/|
lol.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
lol.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
omg.txt|/Users/jaro/documents/inc/linked/|
input file is the list of opened files (opening file means 1 line of file) i want to get the last opened file in
e.g. : get last opened file in dir /Users/jaro/documents/inc/face/
output:
wtf.txt
This fetches the last line in the file whose second field is the desired folder name, and prints the first field.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { f=$1 }
END { print f }' file
To test whether the most recent file is also an existing file, I would use the shell to reverse the order with tac and perform the logic; skip the files in the wrong path, and the ones which don't exist, then print the first success and quit.
tac file |
while IFS='|' read -r basename path _; do
case $path in "/Users/jaro/documents/inc/face") ;; *) continue;; esac
test -e "$path/$basename" || continue
echo "$basename"
break
done |
grep .
The final grep . is to produce an exit code which reflects whether or not the command was successful -- if it printed a file, it's okay; if none of the extracted files existed, return error.
Below is my original answer, based on a plausible but apparently incorrect interpretation of your question.
Here is a quick attempt at finding the file with the newest modification time from the list. I avoid parsing ls, prefering instead to use properly machine-parseable output from stat. Since your input file is line-oriented, I assume no file names contain newlines, which simplifies things quite a bit.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { print $2 $1 }' file |
sort -u |
xargs stat -f '%m %N' |
sort -rn |
awk -F '/' '{ print $NF; exit(0) }'
The first sort is to remove any duplicates, to avoid running stat more times than necessary (premature optimization, perhaps), the stat prefixes each line with the file's modification time expressed as seconds since the epoch, which facilitates easy numerical sorting by age, and the final Awk script neatly combines head -n 1 | rev | cut -d / -f1 | rev i.e. extract just the basename from the first line of output, then quit.
If there is any way to use a less wacky input format, that would be an improvement (probably of your life in general as well).
The output format from stat is not properly standardized, but your question is tagged linuxosx so I assume GNU coreutils BSD stat. If portability is desired, maybe look at find (which however may be overkill and/or not much better standardized across diverse platforms) or write a small Perl or Python script instead. (Well, Ruby too, I suppose, but personally, I'd go with Perl.)
perl -F'\|' -lane '{ $t{$F[0]} = (stat($F[1].$F[0]))[10]
if !defined $t{$F[0]} and $F[1] == "/Users/jaro/documents/inc/face/" }
END { print ((sort { $t{$a} <=> $t{$b} } keys %t)[-1]) }' file
atime – The atime (access time) is the time when the data of a file was last accessed. Displaying the contents of a file or executing a shell script will update a file’s atime, for example. You can view the atime with the ls -lu command
http://www.techtrunch.com/linux/ctime-mtime-atime-linux-timestamps
So in your case, will do the trick.
ls -lu /Users/jaro/documents/inc/face/

Grep outputs multiple lines, need while loop

I have a script which uses grep to find lines in a text file (ics calendar to be specific)
My script finds a date match, then goes up and down a few lines to copy the summary and start time of the appointment into a separate variable. The problem I have is that I'm going to have multiple appointments at the same time, and I need to run through the whole process for each result in grep.
Example:
LINE=`grep -F -n 20130304T232200 /path/to/calendar.ics | cut -f1 d:`
And it outputs only the lines, such as
86 89
Then it goes on to capture my other variables, as such:
SUMMARYLINE=$(( $LINE + 5 ))
SUMMARY:`sed -n "$SUMMARYLINE"p /path/to/calendar.ics
my script runs fine with one output, but it obviously won't work with more than 1 and I need for it to. should I send the grep results into an array? a separate text file to read from? I'm sure I'll need a while loop in here somehow. Need some help please.
You can call grep from a loop quite easily:
while IFS=':' read -r LINE notused # avoids the use of cut
do
# First field is now in $LINE
# Further processing
done < <(grep -F -n 20130304T232200 /path/to/calendar.ics)
However, if the file is not too large then it might be easier to read the whole file into an array and more around that.
With your proposed solution, you are reading through the file several times. Using awk, you can do it in one pass:
awk -F: -v time=20130304T232200 '
$1 == "SUMMARY" {summary = substr($0,9)}
/^DTSTART/ {start = $2}
/^END:VEVENT/ && start == time {print summary}
' calendar.ics

Resources