Grep outputs multiple lines, need while loop - bash

I have a script which uses grep to find lines in a text file (ics calendar to be specific)
My script finds a date match, then goes up and down a few lines to copy the summary and start time of the appointment into a separate variable. The problem I have is that I'm going to have multiple appointments at the same time, and I need to run through the whole process for each result in grep.
Example:
LINE=`grep -F -n 20130304T232200 /path/to/calendar.ics | cut -f1 d:`
And it outputs only the lines, such as
86 89
Then it goes on to capture my other variables, as such:
SUMMARYLINE=$(( $LINE + 5 ))
SUMMARY:`sed -n "$SUMMARYLINE"p /path/to/calendar.ics
my script runs fine with one output, but it obviously won't work with more than 1 and I need for it to. should I send the grep results into an array? a separate text file to read from? I'm sure I'll need a while loop in here somehow. Need some help please.

You can call grep from a loop quite easily:
while IFS=':' read -r LINE notused # avoids the use of cut
do
# First field is now in $LINE
# Further processing
done < <(grep -F -n 20130304T232200 /path/to/calendar.ics)
However, if the file is not too large then it might be easier to read the whole file into an array and more around that.

With your proposed solution, you are reading through the file several times. Using awk, you can do it in one pass:
awk -F: -v time=20130304T232200 '
$1 == "SUMMARY" {summary = substr($0,9)}
/^DTSTART/ {start = $2}
/^END:VEVENT/ && start == time {print summary}
' calendar.ics

Related

Is there a way to take an input that behaves like a file in bash?

I have a task where I'm given an input of the format:
4
A CS 22 M
B ECE 23 M
C CS 23 F
D CS 22 F
as the user input from the command line. From this, we have to perform tasks like determine the number of male and female students, determine which department has the most students, etc. I have done this using awk with the input as a file. Is there any way to do this with a user input instead of a file?
Example of a command I used for a file (where the content in the file is in the same format):
numberofmales=$(awk -F ' ' '{print $4}' file.txt | grep M | wc -l) #list number of males
Not Reproducible
It works fine for me, so your problem can't be reproduced with either GNU or BSD awk under Bash 5.0.18(1). With your posted code and file sample:
$ numberofmales=$(awk -F ' ' '{print $4}' file.txt | grep M | wc -l)
$ echo $numberofmales
2
Check to make sure you don't have problems in your input file, or elsewhere in your code.
Also, note that if you call awk without a file argument or input from a pipe, it tries to collect data from standard input. It may not actually be hanging; it's probably just waiting on end-of-file, which you can trigger with CTRL+D.
Recommended Improvements
Even if your code works, it can be improved. Consider the following, which skips the unnecessary field-separator definition and performs all the actions of your pipeline within awk.
males=$(
awk 'tolower($4)=="m" {count++}; END {print count}' file.txt
)
echo "$males"
Fewer moving parts are often easier to debug, and can often be more performant on large datasets. However, your mileage may vary.
User Input
If you want to use user input rather than a file, you can use standard input to collect your data, and then pass it as a quoted argument to a function. For example:
count_males () {
awk 'tolower($4)=="m" {count++}; END {print count}' <<< "$*"
}
echo "Enter data (CTRL-D when done):"
data=$(cat -)
# If at command prompt, wait until EOF above before
# pasting this line. Won't matter in scripts.
males=$(count_males "$data")
The result is now stored in males, and you can echo "$males" or make use of the variable in whatever other way you like.
Bash indeed does not care whether a file handle is connected to standard input or to a file, and neither does Awk.
However, if you want to pass the same input to multiple Awk instances, it really does make sense to store it in a temporary file.
A better overall solution is to write a better Awk script so you only need to read the input once.
awk 'NF > 1 { ++a[$4] } END { for (g in a) print g, a[g] }'
Demo: https://ideone.com/0ML7Xk
The NF > 1 condition is to skip the silly first line. Probably don't put that information there in the first place and let Awk figure out how many lines there are; it's probably better at counting than you are anyway.

Utilising variables in tail command

I am trying to export characters from a reference file in which their byte position is known. To do this, I have a long list of numbers stored as a variable which have been used as the input to a tail command.
For example, the reference file looks like:
ggaaatgcattcaaacatgc
And the list looks like:
5
10
7
15
I have tried using this code:
list=$(<pos.txt)
echo "$list"
cat ref.txt | tail -c +"list" | head -c1 > out.txt
However, it keeps returning "invalid number of bytes: '+5\n10\n7\n15...'"
My expected output would be
a
t
g
a
...
Can anybody tell me what I'm doing wrong? Thanks!
It looks like you are trying to access your list variable in your tail command. You can access it like this: $list rather than just using quotes around it.
Your logic is flawed even after fixing the variable access. The list variable includes all lines of your list.txt file. Including the newline character \n which is invisible in many UIs and programs, but it is of course visible when you are manually reading single bytes. You need to feed the lines one by one to make it work properly.
Also unless those numbers are indexes from the end, you need to feed them to head instead of tail.
If I understood what you are attempting to do correctly, this should work:
while read line
do
head -c $line ref.txt | tail -c 1 >> out.txt
done < pos.txt
The reason for your command failure is simple. The variable list contains a multi-line string stored from the pos.txt files including newlines. You cannot pass not more than one integer value for the -c flag.
Your attempts can be fixed quite easily with removing calls to cat and using a temporary variable to hold the file content
while IFS= read -r lineNo; do
tail -c "$lineNo" ref.txt | head -c1
done < pos.txt
But then if your intentions is print the desired output in a new-line every time, head does not output that way. It just forms a string atga for your given input in a single line and not across multiple lines with one character at each line.
As Gordon mentions in one of the comments, for much more efficient FASTA files processing, you could just use one invocation of awk though (skipping multiple forks to head/tail). Your provided input does not involve any headers to skip which would be straightforward as
awk ' FNR==NR{ n = split($0,arr,""); for(i=1;i<=n;i++) hash[i] = arr[i] }
( $0 in hash ){ print hash[$0] } ' ref.txt pos.txt
You could use cut instead of tail:
pos=$(<pos.txt)
cut -c ${pos//$'\n'/,} --output-delimiter=$'\n' ref.txt
Or just awk:
awk -F '' 'NR==FNR{c[$0];next} {for(i in c) print $i}' pos.txt ref.txt
both yield:
a
g
t
a

performance issues in shell script

I have a 200 MB tab separated text file with millions of rows. In this file, I have a column with multiple locations like US , UK , AU etc.
Now I want to break this file on the basis of this column. Though this code is working fine for me, but facing performance issue as it is taking more than 1 hour to split the file into multiple files based on locations. Here is the code:
#!/bin/bash
read -p "Please enter the file to split " file
read -p "Enter the Col No. to split " col_no
#set -x
header=`head -1 $file`
cnt=1
while IFS= read -r line
do
if [ $((cnt++)) -eq 1 ]
then
echo "$line" >> /dev/null
else
loc=`echo "$line" | cut -f "$col_no"`
f_name=`echo "file_"$loc".txt"`
if [ -f "$f_name" ]
then
echo "$line" >> "$f_name";
else
touch "$f_name";
echo "file $f_name created.."
echo "$line" >> "$f_name";
sed -i '1i '"$header"'' "$f_name"
fi
fi
done < $file
The logic applied here is that we are reading the entire file only once, and depending on the locations, we are creating and appending the data to it.
Please suggest necessary improvements in the code to enhance its performance.
Following is a sample data and is separated by colon instead of tab. The country code is in the 4th column:
ID1:ID2:ID3:ID4:ID5
100:abcd:TEST1:ZA:CCD
200:abcd:TEST2:US:CCD
300:abcd:TEST3:AR:CCD
400:abcd:TEST4:BE:CCD
500:abcd:TEST5:CA:CCD
600:abcd:TEST6:DK:CCD
312:abcd:TEST65:ZA:CCD
1300:abcd:TEST4153:CA:CCD
There are a couple of things to bear in mind:
Reading files using while read is slow
Creating subshells and executing external processes is slow
This is a job for a text processing tool, such as awk.
I would suggest that you used something like this:
# save first line
NR == 1 {
header = $0
next
}
{
filename = "file_" $col ".txt"
# if country code has changed
if (filename != prev) {
# close the previous file
close(prev)
# if we haven't seen this file yet
if (!(filename in seen)) {
print header > filename
}
seen[filename]
}
# print whole line to file
print >> filename
prev = filename
}
Run the script using something along the following lines:
awk -v col="$col_no" -f script.awk file
where $col_no is a shell variable containing the column number with the country codes.
If you don't have too many different country codes, you can get away with leaving all the files open, in which case you can remove the call to close(filename).
You can test the script on the sample provided in the question like this:
awk -F: -v col=4 -f script.awk file
Note that I've added -F: to change the input field separator to :.
I think Tom is on the right track, but I'd simplify this a little.
Awk is magical in some ways. One of those ways is that it will keep all its input and output file handles open unless you explicitly close them. So if you create a variable containing an output file name, you can simply redirect to your variable and trust that awk will send the data to the place you've specified and eventually close the output file when it runs out of input to process.
(N.B. an extension of this magic is that in addition to redirects, you can maintain multiple PIPES. Imagine if you were to cmd="gzip -9 > file_"$4".txt.gz"; print | cmd)
The following splits your file without adding a header to each output file.
awk -F: 'NR>1 {out="file_"$4".txt"; print > out}' inp.txt
If adding the header is important, a little more code is required. But not much.
awk -F: 'NR==1{h=$0;next} {out="file_"$4".txt"} !(out in files){print h > out; files[out]} {print > out}' inp.txt
Or, because this one-liner is now a bit long, we can split it out for explanation:
awk -F: '
NR==1 {h=$0;next} # Capture the header
{out="file_"$4".txt"} # Capture the output file
!(out in files){ # If we haven't seen this output file before,
print h > out; # print the header to it,
files[out] # and record the fact that we've seen it.
}
{print > out} # Finally, print our line of input.
' inp.txt
I tested these two scripts successfully on the input data you provided in your question. With this type of solution, there is no need to sort your input data -- your output in each file will be in the order in which that subset's records appeared in your input data.
Note: different versions of awk will permit you to open different numbers of open files. GNU awk (gawk) has a limit in the thousands -- significantly more than the number of countries you might have to deal with. BSD awk version 20121220 (in FreeBSD) appears to run out after 21117 files. BSD awk version 20070501 (in OS X El Capitan) is limited to 17 files.
If you're not confident in your potential number of open files, you can experiment with your version of awk usig something like this:
mkdir -p /tmp/i
awk '{o="/tmp/i/file_"NR".txt"; print "hello" > o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
You can also test the number of open pipes:
awk '{o="cat >/dev/null; #"NR; print "hello" | o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
(If you have a /dev/yes or something that just spits out lines of text ad nauseam, that would be better than using /dev/random for input.)
I haven't previously come across this limit in my own awk programming because when I've needed to create many many output files, I've always used gawk. :-P

fast alternative to grep file multiple times?

I currently use long piped bash commands to extract data from text files like this, where $f is my file:
result=$(grep "entry t $t " $f | cut -d ' ' -f 5,19 | \
sort -nk2 | tail -n 1 | cut -d ' ' -f 1)
I use a script that might do hundreds of similar searches of $f ,sorting selected lines in various ways depending on what I'm pulling out. I like one-line bash strings with a bunch of pipes because its compact and easy, but it can take forever. Can anyone suggest a faster alternative? Maybe something that loads the whole file into memory first?
Thanks
You might get a boost with doing the whole pipe with gawk or another awk that has asorti by doing:
contents="$(cat "$f")"
result="$(awk -vpattern="entry t $t" '$0 ~ pattern {matches[$5]=$19} END {asorti(matches,inds); print inds[1]}' <<<"$contents")"
This will read "$f" into a variable then we'll use a single awk command (well, gawk anyway) to do all the rest of the work. Here's how that works:
-vpattern="entry t $t": defines an awk variable named pattern that contains the shell variable t
$0 ~ pattern matches the current line against the pattern, if it matches we'll do the part in the braces, otherwise we skip it
matches[$5]=$19 adds an entry to an array (and creates the array if needed) where the key is the 5th field and the value is the 19th
END do the following function after all the input has been processed
asorti(matches,inds) sort the entries of matches such that the inds is an array holding the order of the keys in matches to get the values in sorted order
print inds[1] prints the index in matches (i.e., a $5 from before) associated with the lowest 19th field
<<<"$contents" have awk work on the value in the shell variable contents as though it were a file it was reading
Then you can just update the pattern for each, not have to read the file from disk each time and not need so many extra processes for all the pipes.
You'll have to benchmark to see if it's really faster or not though, and if performance is important you really should think about moving to a "proper" language instead of shell scripting.
Since you haven't provided sample input/output this is just a guess and I only post it because there's other answers already posted that you should not do, so - this may be what you want instead of that one line:
result=$(awk -v t="$t" '
BEGIN { regexp = "entry t " t " " }
$0 ~ regexp {
if ( ($6 > maxKey) || (maxKey == "") ) {
maxKey = $6
maxVal = $5
}
}
END { print maxVal }
' "$f")
I suspect your real performance issue, however, isn't that script but that you are running it and maybe others inside a loop that you haven't shown us. If so, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice and post a better example so we can help you.

Different output for pipe in script vs. command line

I have a directory with files that I want to process one by one and for which each output looks like this:
==== S=721 I=47 D=654 N=2964 WER=47.976% (1422)
Then I want to calculate the average percentage (column 6) by piping the output to AWK. I would prefer to do this all in one script and wrote the following code:
for f in $dir; do
echo -ne "$f "
process $f
done | awk '{print $7}' | awk -F "=" '{sum+=$2}END{print sum/NR}'
When I run this several times, I often get different results although in my view nothing really changes. The result is almost always incorrect though.
However, if I only put the for loop in the script and pipe to AWK on the command line, the result is always the same and correct.
What is the difference and how can I change my script to achieve the correct result?
Guessing a little about what you're trying to do, and without more details it's hard to say what exactly is going wrong.
for f in $dir; do
unset TEMPVAR
echo -ne "$f "
TEMPVAR=$(process $f | awk '{print $7}')
ARRAY+=($TEMPVAR)
done
I would append all your values to an array inside your for loop. Now all your percentages are in $ARRAY. It should be easy to calculate the average value, using whatever tool you like.
This will also help you troubleshoot. If you get too few elements in the array ${#ARRAY[#]} then you will know where your loop is terminating early.
# To get the percentage of all files
Percs=$(sed -r 's/.*WER=([[:digit:].]*).*/\1/' *)
# The divisor
Lines=$(wc -l <<< "$Percs")
# To change new lines into spaces
P=$(echo $Percs)
# Execute one time without the bc. It's easier to understand
echo "scale=3; (${P// /+})/$Lines" | bc

Resources