line traversal in awk - shell

I am doing a file traversal in awk. An example of this is
Dat time range column session - 1
time name place session animal - 2
hi bye name things - 3
In both of these . I need to traverse line by line and in I need to traverse word by word in the line that contains session .
Thus in this case I need to reach line 1 and 2 as it contains the word session and not line 3 as it doesn't contain that field(In the sense I can skip this). From there I need to traverse word by word to reach the session field .
I know $0 can represent the whole line. But my question is how to traverse word by word after reaching the line.
Could you please help me regarding this. Thank you.

You can loop through the current line $0 with this construct:
for(i = 1; i <= NF; i++) print $i
this makes use of the predefined awk variable NF which stands for the number of fields on the current line ($0).
You can examine the value of $i as it iterates through the line and based on that determine what to do with the value. E.g, print it, skip it, etc. if ($i == "session") ...
Update:
You can also use the match() function to determine if the current line you are processing contains the "session" string without iterating through the line. E.g.,
where = match($0, "session")
if (where > 0)
print "Found session in this line";
else
print "session not found in this line";
Note that match() takes a regular expression as the 2nd parameter, so your matches can be quite sophisticated. See this page for more information about this function and other awk string functions.

You can use a for loop, filtering only on the lines that contain "session":
awk '/session/{ for (i = 1; i <= NF; i++) { \
if ($i == "session") \
do_whatever_here \
} \
}'
You can read more on these instructions here: for, string comparison and if.

Related

Awk substring doesnt yield expected result

I've a file whose content is below:
C2:0301,353458082243570,353458082243580,0;
C2:0301,353458082462440,353458082462450,0;
C2:0301,353458082069130,353458082069140,0;
C2:0301,353458082246230,353458082246240,0;
C2:0301,353458082559320,353458082559330,0;
C2:0301,353458080153530,353458080153540,0;
C2:0301,353458082462670,353458082462680,0;
C2:0301,353458081943950,353458081943960,0;
C2:0301,353458081719070,353458081719080,0;
C2:0301,353458081392470,353458081392490,0;
Field 2 and Field 3 (considering , as separator), contains 15 digit IMEI number ranges and not individual IMEI numbers. Usual format of IMEI is 8-digits(TAC)+6-digits(Serial number)+0(padded). The 6 digits(Serial number) part in the IMEI defines the start and end range, everything else remaining same. So in order to find individual IMEIs in the ranges (which is exactly what I want), I need a unary increment loop from 6 digits(Serial number) from the starting IMEI number in Field-2 till 6 digits(Serial number) from the ending IMEI number in Field-3. I am using the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
It gives me the below result:
353458082243570,0
353458082243580,0
353458082462440,0
353458082462450,0
353458082069130,0
353458082069140,0
353458082246230,0
353458082246240,0
353458082559320,0
353458082559330,0
353458080153530,0
353458082462670,0
353458082462680,0
353458081943950,0
353458081943960,0
353458081719070,0
353458081719080,0
353458081392470,0
353458081392480,0
353458081392490,0
The above is as expected except for the below line in the result:
353458080153530,0
The result is actually from the below line in the input file:
C2:0301,353458080153530,353458080153540,0;
But the expected output for the above line in input file is:
353458080153530,0
353458080153540,0
I need to know whats going wrong in my script.
The problem with your script is you start with 2 string variables, v and t, (typed as strings since they are the result of a string operation, substr()) and then convert one to a number with v++ which would strip leading zeros but then you're doing a string comparison with v <= t since a string (t) compared to a number or string or numeric string is always a string comparison. Yes you can add zero to each of the variables to force a numeric comparison but IMHO this is more like what you're really trying to do:
$ cat tst.awk
BEGIN { FS=","; re="(.{8})(.{6})(.*)" }
{
match($2,re,beg)
match($3,re,end)
for (i=beg[2]; i<=end[2]; i++) {
printf "%s%06d%s\n", end[1], i, end[3]
}
}
$ gawk -f tst.awk file
353458082243570
353458082243580
353458082462440
353458082462450
353458082069130
353458082069140
353458082246230
353458082246240
353458082559320
353458082559330
353458080153530
353458080153540
353458082462670
353458082462680
353458081943950
353458081943960
353458081719070
353458081719080
353458081392470
353458081392480
353458081392490
and when done with appropriate variables like that no conversion is necessary. Note also that with the above you don't need to repeatedly state the same or relative numbers to extract the part of the strings you care about, you just state the number of characters to skip (8) and the number to select (6) once. The above uses GNU awk for the 3rd arg to match().
The problem was in the while(v <= t) part of the script. I believe with leading 0s the match was not happening properly. So I ensured that they are casted into int while doing the comparison in the while loop. The AWK documentation says you can cast a value to int by using value+0. So my while(v <= t) in the awk script needed to change to while(v+0 <= t+0) . So the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
was changed to :
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v+0 <= t+0) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
That only change got me the expected value for the failure case. For example this in my input file:
C2:0301,353458080153530,353458080153540,0;
Now gives me individual IMEIs as :
353458080153530,0
353458080153540,0
Use an if statement that checks for leading zeros in variable v setting y accordingly:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) { if (substr(v,1,1)=="0") { v++;y="0"v } else { v++;y=v } ;printf %s%0"6"s%s,%s\n", substr($3,1,8),y,substr($3,15,2),$4;v=y } }' TEMP.OUT.merge_range_part1_21
Make sure that the while condition is contained in braces and also that v is incremented WITHIN the if conditions.
Set v=y at the end of the statement to allow this to work on additional increments.

Deleting lines with more than 30% lowercase letters

I try to process some data but I'am unable to find a working solution for my problem. I have a file which looks like:
>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg
and many lines more....
I want to filter out all the lines and the corresponding headers (header starts with >) where the sequence string (those not starting with >) are containing 30 or more percent lowercase letters. And the sequence strings can span multiple lines.
So after command xy the output should look like:
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
I tried some mix of a while loop for reading the input file and then working with awk, grep, sed but there was no good outcome.
Here's one idea, which sets the record separator to ">" to treat each header with its sequence lines as a single record.
Because the input starts with a ">", which causes an initial empty record, we guard the computation with NR > 1 (record number greater than one).
To count the number of characters we add the lengths of all the lines after the header. To count the number of lower-case characters, we save the string in another variable and use gsub to replace all the lower-case letters with nothing --- just because gsub returns the number of substitutions made, which is a convenient way of counting them.
Finally we check the ratio and print or not (adding back the initial ">" when we do print).
BEGIN { RS = ">" }
NR > 1 {
total_cnt = 0
lower_cnt = 0
for (i=2; i<=NF; ++i) {
total_cnt += length($i)
s = $i
lower_cnt += gsub(/[a-z]/, "", s)
}
ratio = lower_cnt / total_cnt
if (ratio < 0.3) print ">"$0
}
$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
Or:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name
RT - This value is set by what is matched by RS above
a=RT - save previous RT value
n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars
if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars
awk '/^>/{b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
H=$0;B="";next}
{B=( (B != "") ? B "\n" : "" ) $0}
END{ b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
}' YourFile
quick qnd dirty, a function suite better the need for printing
Nowadays I would not use sed or awk anymore for anything longer than 2 lines.
#! /usr/bin/perl
use strict; # Force variable declaration.
use warnings; # Warn about dangerous language use.
sub filter # Declare a sub-routing, a function called `filter`.
{
my ($header, $body) = #_; # Give the first two function arguments the names header and body.
my $lower = $body =~ tr/a-z//; # Count the translation of the characters a-z to nothing.
print $header, $body, "\n" # Print header, body and newline,
unless $lower / length ($body) > 0.3; # unless lower characters have more than 30%.
}
my ($header, $body); # Declare two variables for header and body.
while (<>) { # Loop over all lines from stdin or a file given in the command line.
if (/^>/) { # If the line starts with >,
filter ($header, $body) # call filter with header and body,
if defined $header; # if header is defined, which is not the case at the beginning of the file.
($header, $body) = ($_, ''); # Assign the current line to header and an empty string to body.
} else {
chomp; # Remove the newline at the end of the line.
$body .= $_; # Append the line to body.
}
}
filter ($header, $body); # Filter the last record.

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

How can I read a CSV file if only non-empty fields are wrapped by double quotes?

I'm trying to read a CSV file in a Bash script. I achieved that successfully using gawk and specifying FPAT like:
gawk -v LOGFILE="${LOGFILE}" 'BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
NR == 1{
# doing some logic with header
}
NR >= 2{
# doing some logic with fields
}' <filename>
The problem here is, the file contains data like:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
Now, with this data I'm getting wrong data because it is ignoring commas, which is giving me wrong position number of extracted data.
For example, it is telling "7865431234" is present at 3rd position whereas it is at 6th.
Can anyone suggest the changes to get the correct position of fields?
Your FPAT requires each field to contain at least one character, but you want to recognize empty fields with zero characters. Add an alternative to FPAT that allows zero characters:
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("[%s]", $i); print "" }'
Note the extra | at the end of FPAT. The action simply identifies the record number, the number of fields, and surrounds the value of each field with square brackets.
When your data string is provided to that script, the output is:
1:8:["RAM"]["31st street, Bengaluru, India"][][][]["7865431234"][]["VALID"]
That shows the four empty fields quite clearly.
Now all you have to do is deal with:
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,"",,,"INVALID"
where there are double quotes inside the quoted value. That's not dreadfully hard to manage:
gawk 'BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")[^,]*|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("%d[%s]", i, $i); print "" }' "$#"
The FPAT says that a field is:
a sequence of non-commas,
or it is a field started with a double quote, containing zero or more instances of either:
a non-quote, or
two double quotes
followed by a double quote and optional non-comma data
or it is empty
Note that the 'optional non-comma data' should be empty, and only appears in malformed CSV data.
Given input data:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,,,,"INVALID"
"Some","","Empty","",Fields "" Wrapped,"",in quotes
"Malformed" CSV,Data,"Note it has data after" a close quote,"and before a comma,",,"INVALID"
This produces:
1:8:1["RAM"]2["31st street, Bengaluru, India"]3[]4[]5[]6["7865431234"]7[]8["VALID"]
2:8:1["Mr ""Manipulator"", the Artisan"]2["29th Street, Delhi, India"]3[]4[]5[]6[]7[]8["INVALID"]
3:7:1["Some"]2[""]3["Empty"]4[""]5[Fields "" Wrapped]6[""]7[in quotes]
4:6:1["Malformed" CSV]2[Data]3["Note it has data after" a close quote]4["and before a comma,"]5[]6["INVALID"]
Note that the field numbers are included as a prefix to the bracketed data (so I tweaked the print format slightly).
About the only format this doesn't handle is one where newlines can be embedded in the data for a field — by the nature of the line-based input, it assumes that no field is split over multiple lines. (It also means it won't properly recognize a field that starts with a double quote and doesn't have a matching double quote before the end of the line. I suppose you could add an alternative to recognize that. It would be better just to make the data right.)
Note the advice in Sobrique's answer to use a tool designed to handle CSV for handling CSV. That is generally a good idea, and the more complex the sets of variations you have to deal with, the better an idea it is. This is close to as complicated a regex as you should consider using. Also note that although RFC 4180 defines a version of CSV formally and rigorously, there are multiple programs (including MS Office) that handle different but related formats.
If you have csv that needs parsing, then whilst you can usually hack it with a regex, it's far easier to user a parser.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new;
open ( my $input, '<', 'flarg.csv' ) or die $!;
while ( my $row = $csv -> getline ( $input ) ) {
if ( $. == 1 ) {
# do first row stuff;
print "Header: ", join ",", #$row,"\n";
}
else {
print join "\n", #$row;
}
}
Or simpler yet - use Text::ParseWords which is core.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
while ( my $line = <DATA> ) {
my #fields = parse_line(',', 1, $line);
print join "\n", #fields;
}
__DATA__
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"

search (e.g. awk, grep, sed) for string, then look for X lines above and another string below

I need to be able to search for a string (lets use 4320101), print 20 lines above the string and print after this until it finds the string
For example:
Random text I do not want or blank line
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
Random text I do not want or blank line
I just want the following result outputted to a file:
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
There are multiple examples of these groups of text in a file that I want.
I tried using this below:
cat filename | grep "</eventUpdate>" -A 20 4320101 -B 100 > greptest.txt
But it only ever shows for 20 lines either side of the string.
Notes:
- the line number the text is on is inconsistent so I cannot go off these, hence why I am using -A 20. - ideally I'd rather have it so when it searches after the string, it stops when it finds and then carries on searching.
Summary: find 4320101, output 20 lines above 4320101 (or one line of white space), and then output all lines below 4320101 up to
</eventUpdate>
Doing research I am unsure of how to get awk, nawk or sed to work in my favour to do this.
This might work for you (GNU sed):
sed ':a;s/\n/&/20;tb;$!{N;ba};:b;/4320102/!D;:c;n;/<\/eventUpdate>/!bc' file
EDIT:
:a;s/\n/&/20;tb;$!{N;ba}; this keeps a window of 20 lines in the pattern space (PS)
:b;/4320102!D; this moves the above window through the file until the pattern 4320102 is found.
:c;n;/<\/eventUpdate>/!bc the 20 line window is printed and any subsequent line until the pattern <\/eventUpdate> is found.
Here is an ugly awk solution :)
awk 'BEGIN{last=1}
{if((length($0)==0) || (Random ~ $0))last=NR}
/4320101/{flag=1;
if((NR-last)>20) last=NR-20;
cmd="sed -n \""last+1","NR-1"p \" input.txt";
system(cmd);
}
flag==1{print}
/eventUpdate/{flag=0}' <filename>
So basically what it does is keeps track of the last blank line or line containing Random pattern in the last variable. Now if the 4320101 has been found, it prints from that line -20 or last whichever is nearer through a system sed command. And sets the flag. The flag causes the next onwards lines to be printed till eventUpdate has been found. Have not tested though, but should be working
Look-behind in sed/awk is always tricky.. This self contained awk script basically keeps the last 20 lines stored, when it gets to 4320101 it prints these stored lines, up to the point where the blank or undesired line is found, then it stops. At that point it switches into printall mode and prints all lines until the eventUpdate is encountered, then it prints that and quits.
awk '
function store( line ) {
for( i=0; i <= 20; i++ ) {
last[i-1] = last[i]; i++;
};
last[20]=line;
};
function purge() {
for( i=20; i >= 0; i-- ) {
if( length(last[i])==0 || last[i] ~ "Random" ) {
stop=i;
break
};
};
for( i=(stop+1); i <= 20; i++ ) {
print last[i];
};
};
{
store($0);
if( /4320101/ ) {
purge();
printall=1;
next;
};
if( printall == 1) {
print;
if( /eventUpdate/ ) {
exit 0;
};
};
}' test
Let's see if I understand your requirements:
You have two strings, which I'll call KEY and LIMIT. And you want to print:
At most 20 lines before a line containing KEY, but stopping if there is a blank line.
All the lines between a line containing KEY and the following line containing LIMIT. (This ignores your requirement that there be no more than 100 such lines; if that's important, it's relatively straightforward to add.)
The easiest way to accomplish (1) is to keep a circular buffer of 20 lines, and print it out when you hit key. (2) is trivial in either sed or awk, because you can use the two-address form to print the range.
So let's do it in awk:
#file: extract.awk
# Initialize the circular buffer
BEGIN { count = 0; }
# When we hit an empty line, clear the circular buffer
length() == 0 { count = 0; next; }
# When we hit `key`, print and clear the circular buffer
index($0, KEY) { for (i = count < 20 ? 0 : count - 20; i < count; ++i)
print buf[i % 20];
hi = 0;
}
# While we're between key and limit, print the line
index($0, KEY),index($0, LIMIT)
{ print; next; }
# Otherwise, save the line
{ buf[count++ % 20] = $0; }
In order to get that to work, we need to set the values of KEY and LIMIT. We can do that on the command line:
awk -v "KEY=4320101" -v "LIMIT=</eventUpdate>" -f extract.awk $FILENAME
Notes:
I used index($0, foo) instead of the more usual /foo/, because it avoids having to escape regex special characters, and there is nowhere in the requirements that regexen are even desired. index(haystack, needle) returns the index of needle in haystack, with indices starting at 1, or 0 if needle is not found. Used as a true/false value, it is true of needle is found.
next causes processing of the current line to end. It can be quite handy, as this little program shows.
You can try something like this -
awk '{
a[NR] = $0
}
/<\/eventUpdate>/ {
x = NR
}
END {
for (i in a) {
if (a[i]~/4320101/) {
for (j=i-20;j<=x;j++) {
print a[j]
}
}
}
}' file
The simplest way is to use 2 passes of the file - the first to identify the line numbers in the range within which your target regexp is found, the second to print the lines in the selected range, e.g.:
awk '
NR==FNR {
if ($0 ~ /\<4320101\>/ {
for (i=NR-20;i<NR;i++)
range[i]
inRange = 1
}
if (inRange) {
range[NR]
}
if ($0 ~ /<\/eventUpdate>/) {
inRange = 0
}
next
}
FNR in range
' file file

Resources