replace terms with associated abbreviations from other file, in case of matching - macos

I have two files:
1. Pattern file = pattern.txt
2. File containing different terms = terms.txt
pattern.txt contain two columns, separated by ;
In the first column I have several terms and in the second column abbreviations,
associated to the first column, same line.
terms.txt contain single words and terms defined by single words but also
by a combination of words.
pattern.txt
Berlin;Brln
Barcelona;Barcln
Checkpoint Charly;ChckpntChrl
Friedrichstrasse;Fridrchstr
Hall of Barcelona;HllOfBarcln
Paris;Prs
Yesterday;Ystrdy
terms.txt
Berlin
The Berlinale ended yesterday
Checkpoint Charly is still in Friedrichstrasse
There will be a fiesta in the Hall of Barcelona
Paris is a very nice city
The target is to replace terms with standardised abbreviations and to find out which terms
have no abbreviation.
As result I would like to have two files.
The first file is a new terms file, with terms replaced by abbreviations where it could be replaced.
The second file containing a list with all terms that doesn't have an abbreviation.
The output is case insensitive, I don't make difference between "The" and "the".
new_terms.txt
Brln
The Berlinale ended Ystrdy
ChckpntChrl is still in Fridrchstr
There will be a fiesta in the HllOfBarcln
Prs is a very nice city
terms_without_abbreviations.txt
a
be
Berlinale
city
ended
fiesta
in
is
nice
of
still
The
There
very
will
I will appreciate your help and thanks in advance for your time and hints!

This is mostly what you need:
BEGIN { FS=";"; }
FNR==NR { dict[tolower($1)] = $2; next }
{
line = "";
count = split($0, words, / +/);
for (i = 1; i <= count; i++) {
key = tolower(words[i]);
if (key in dict) {
words[i] = dict[key];
} else {
result[key] = words[i];
}
line = line " " words[i];
}
print substr(line, 2);
}
END {
count = asorti(result, sorted);
for (i = 1; i <= count; i++) {
print result[sorted[i]];
}
}

Ok, so I had a bit of a crack, but will explain issues:
If you have multiple changes in pattern.txt that can pertain to a single line, the first change will make its change and the second will not (eg. Barcelona;Barcln and Hall of Barcelona;HllOfBarcln, obviously if Barcln has already been done when you get to the longer version it will not longer exist and so no change made)
Similar to above, there is no abbreviation for the word 'Hall' so again if we assume above is true and only the first change was made, your new file for changes will include hall as not having an abbreviation
#!/usr/bin/awk -f
BEGIN{
FS = ";"
IGNORECASE = 1
}
FNR == NR{
abbr[tolower($1)] = $2
next
}
FNR == 1{ FS = " " }
{
for(i = 1; i <= NF; i++){
item = tolower($i)
if(!(item in abbr) && !(item in twa)){
twa[item]
print item > "terms_without_abbreviations.txt"
}
}
for(i in abbr)
gsub("\\<"i"\\>", abbr[i])
print > "new_terms.txt"
}
There are probably other gotchas to look for but it is a vague direction. Not sure how you would get around my points above??

Related

awk: Interpreting strings as mathematical expressions

Context: I have an input file that contains parameters with associated values followed by literal mathematical expressions such as:
PARAMETERS DEFINITION
A = 5; B = 2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
...
and I would like to get the strings of the second part to be interpreted as mathematical expressions so that I get the results of the expressions in my output file:
...
MATHEMATICAL EXPRESSIONS
10
0.2
...
What I did already: So far, using awk, I store all the parameters names and their corresponding values in two distinct arrays. I then replace each parameter with its value so that I am now in a similar situation as the author of this thread.
However, the answers s/he gets are not in awk except for the last one which is very specific to her/his situation, and hard to understand for me as a beginner with awk and shell scripting.
What I tried afterwards: As I have no clue how to do this in awk, the idea I had was to store the new field value in a variable, then use a shell command within the awk script like this:
#!bin/awk -f
BEGIN{}
{
myExpression=$1
system("echo $myExpression | bc")
}
END{}
This, unfortunately does not work as the variable is somehow not recognized by the echo command.
What I would like:
I would prefer a solution using awk alone with no call to external functions, however, I am not against one using a shell command if it is simpler.
EDIT Taking into account all the comments so far, I will be more precise, my input files look more like this:
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
The names of the parameters are complex enough so that any match of the string: "[param#]" within the document corresponds to a parameter that I want changed for its value.
Below is the way I manage to store the parameters and their value in arrays is the following:
{
if (match($2,/PARAMETERS_DEFINITION/) != 0) {paramSwitch = 1}
if (match($2,/MATHEMATICAL_EXPRESSIONS/) != 0) {paramSwitch = 0}
if (paramSwitch == 1)
{
parameterName[numOfParam] = $1 ;
parameterVal[numOfParam] = $3 ;
numOfParam += 1
}
}
Instead of this:
{
myExpression=$1
system("echo $myExpression | bc")
}
I think you'd want this:
{
myExpression=$1
system("echo " myExpression " | bc")
}
That's because in awk, assignments do not end up as environment variables, and putting strings next to each other concatenates them.
You asking awk: Interpreting strings as mathematical expressions - this functionality usually called as eval, and no, (AFAIK) awk doesn't knows such function. Therefore your questions is an typical XY problem
The right tool for this is bc, where you (nearly) don't need modify anything, and simply feed the bc with your input, only ensure than the variables are are lowercase, such the following input (edited the your example)
#PARAMETERS DEFINITION
a=5; b=2; c=1.5; d=7.5
#MATHEMATICAL EXPRESSIONS
a*b
c/d
using like
bc -l < inputfile
produces
10
.20000000000000000000
EDIT
For your edit, for the new input data. The following
grep '\[' inputfile | sed 's/[][]//g' | bc -l
for the input
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
produces the following output:
10
.20000000000000000000
e.g. grepping out only lines what contains [ - any param definition or expression, remove any [], e.g. creating the following bc program:
param1 = 5
param2 = 2
param3 = 1.5
param4 = 7.5
param1*param2
param3/param4
and send the whole "program" to bc...
Using BIDMAS as a basis i have created this mathematical function in awk
I have not included brackets(or indices) yet as they will require some extra effort but i may add them later
This awk script effectively works as bc does.
No system call required, all in awk.
Generic version for all applications
awk '{split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
For your specific example
awk '
/MATHEMATICAL_EXPRESSIONS/{z=1}
NR>1&&!z{split($0,y," = ");x[y[1]]=y[2]}
z&&/[\+\-\/\*]/{
for (n in x)gsub(n,x[n])
split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
There's something like an eval for awk, its a magical conversion when needed in the context, here adding +0 would do the convertion.
What I got for you (detailled version below) with a file named awkinput with your exemple input
awk '/[A-Z]=[0-9.]+;/ { for (i=1;i<=NF ;i++) { print "working on "$i; split($i,fields,"="); sub(/;/,"",fields[2]); params[fields[1]]=strtonum(fields[2]) } }; /[A-Z](*|\/|+|-)[A-Z]/ { for (p in params) { sub(p, params[p],$0); }; system("echo " $0 " | bc -ql") }' awkinput
Detailled:
/[A-Z]=[0-9.]+;?/ { # if we match something like A=4.2 with or wothout a ; at end
for (i=1;i<=NF ;i++) { # loop through the fields (separated by space, the default Field Separator of awk)
print "working on "$i; # inform on what we do
split($i,fields,"="); # split in an array to get param and value
sub(/;/,"",fields[2]); # Eventually remove the ; at end
params[fields[1]]=strtonum(fields[2]) # new array of parameters where the values are numeric
}
}
/[A-Z](*|\/|+|-)[A-Z]/ { #when the line match a math operation with one param on each side (at least)
for (p in params) { # loop over know params
sub(p, params[p],$0); # replace each param with its value
};
system("echo " $0 " | bc -ql") # print the result (no way to get of system call here)
}
Drawback:
A math of the form AB*C would be resolved to 52*1.5
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
$ awk -vRS='[= ;\n]' '{if ($0 ~ /[0-9]/){a[x] = $0; print x"="a[x]}else{x=$0}}/MATHEMATICAL/{print "MATHEMATICAL EXPRESSIONS"}{if ($0~"*") print a[substr($0,1,1)] * a[substr($0,3,1)]}{if ($0~"/") print a[substr($0,1,1)] / a[substr($0,3,1)]}' test
A=5
B=2
C=1.5
D=7.5
MATHEMATICAL EXPRESSIONS
10
0.2
Formatted nicely:
$ cat test.awk
# Store all variables in an array
{
if ($0 ~ /[0-9]/){
a[x] = $0;
print x " = " a[x] # Print the keys & values
}
else{
x = $0
}
}
# Print header
/MATHEMATICAL/ {print "MATHEMATICAL EXPRESSIONS"}
# Do the maths (case can work too, but it's not as widely available)
{
if ($0~"*")
print a[substr($0,1,1)] * a[substr($0,3,1)]
}
{
if ($0~"/")
print a[substr($0,1,1)] / a[substr($0,3,1)]
}
{
if ($0~"+")
print a[substr($0,1,1)] + a[substr($0,3,1)]
}
{
if ($0~"-")
print a[substr($0,1,1)] - a[substr($0,3,1)]
}
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
D+C
C-A
$ awk -f test.awk -vRS='[= ;\n]' test
A = 5
B = 2
C = 1.5
D = 7.5
MATHEMATICAL EXPRESSIONS
10
0.2
9
-3.5

How to merge rows from the same column using unix tools

I have a text file that looks like the following:
1000000 45 M This is a line This is another line Another line
that breaks into that also breaks that has a blank
multiple rows into multiple rows - row below.
How annoying!
1000001 50 F I am another I am well behaved.
column that has
text spanning
multiple rows
I would like to convert this into a csv file that looks like:
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!
== EDIT ==
I've modified the text file to have \n between rows - maybe this will be helpful?
== EDIT 2 ==
I've modified the text file to have a blank row.
Using GNU awk
gawk '
BEGIN { FIELDWIDTHS="11 6 5 22 22" }
length($1) == 11 {
if ($1 ~ /[^[:blank:]]/) {
if (f1) print_line()
f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
}
else {
f4 = f4" "$4; f5 = f5" "$5
}
}
function rtrim(str) {
sub(/[[:blank:]]+$/, "", str)
return str
}
function print_line() {
gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
}
END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "
I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.
Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.
As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.
Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:
use strict;
use warnings;
chomp (my $first_line = <>);
my #fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } #fields;
my #cols = unpack $format, $first_line;
while(<>) {
chomp( my $line = $_ );
my #tmp = unpack $format, $line;
if ($tmp[0] ne '') {
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
#cols = #tmp;
}
else {
for (1..$#tmp) {
$cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
}
}
}
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
Output:
1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."
Using this awk:
awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5];
for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.
Something along the lines of :
#import csv
inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
if len(results) == 0:
isNewItem = True
else if line == '':
isNewItem = True
continue
else:
inNewItem = False
temp = line.split()
if isNewItem:
results.append(temp)
else
lastRow = results[-1]
combinedRow = []
for leftColumn, rigtColumn in lastRow, temp:
combinedRow.append(leftColumn + rightColumn)
with open(csvOutputFileName, 'w') as outFile:
csv.write(results)

awk script: removing line previous to pattern match and after, until a blank line

I began learning awk yesterday in attempt to solve this problem (and learn a useful new language). At first I tried using sed, but soon realized it was not the correct tool to access/manipulate lines previous to a pattern match.
I need to:
Remove all lines containing "foo" (trivial on it's own, but not whilst keeping track of previous lines)
Find lines containing "bar"
Remove the line previous to the one containing "bar"
Remove all lines after and including the line containing "bar" until we reach a blank line
Example input:
This is foo stuff
I like food!
It is tasty!
stuff
something
stuff
stuff
This is bar
Hello everybody
I'm Dr. Nick
things
things
things
Desired output:
It is tasty!
stuff
something
stuff
things
things
things
My attempt:
{
valid=1; #boolean variable to keep track if x is valid and should be printed
if ($x ~ /foo/){ #x is valid unless it contains foo
valid=0; #invalidate x so that is doesn't get printed at the end
next;
}
if ($0 ~ /bar/){ #if the current line contains bar
valid = 0; #x is invalid (don't print the previous line)
while (NF == 0){ #don't print until we reach an empty line
next;
}
}
if (valid == 1){ #x was a valid line
print x;
}
x=$0; #x is a reference to the previous line
}
Super bonus points (not needed to solve my problem but I'm interesting in learning how this would be done):
Ability to remove n lines before pattern match
Option to include/disclude the blank line in output
Below is an alternative awk script using patterns & functions to trigger state changes and manage output, which produces the same result.
function show_last() {
if (!skip && !empty) {
print last
}
last = $0
empty = 0
}
function set_skip_empty(n) {
skip = n
last = $0
empty = NR <= 0
}
BEGIN { set_skip_empty(0) }
END { show_last() ; }
/foo/ { next; }
/bar/ { set_skip_empty(1) ; next }
/^ *$/ { if (skip > 0) { set_skip_empty(0); next } else show_last() }
!/^ *$/{ if (skip > 0) { next } else show_last() }
This works by retaining the "current" line in a variable last, which is either
ignored or output, depending on other events, such as the occurrence of foo and bar.
The empty variable keeps track of whether or not the last variable is really
a blank line, or simple empty from inception (e.g., BEGIN).
To accomplish the "bonus points", replace last with an array of lines which could then accumulate N number of lines as desired.
To exclude blank lines (such as the one that terminates the bar filter), replace the empty test with a test on the length of the last variable. In awk, empty lines have no length (but, lines with blanks or tabs *do* have a length).
function show_last() {
if (!skip && length(last) > 0) {
print last
}
last = $0
}
will result in no blank lines of output.
Read each blank-lines-separated paragraph in as a string, then do a gsub() removing the strings that match the RE for the pattern(s) you care about:
$ awk -v RS= -v ORS="\n\n" '{ gsub(/[^\n]*foo[^\n]*\n|\n[^\n]*\n[^\n]*bar.*/,"") }1' file
It is tasty!
stuff
something
stuff
things
things
things
To remove N lines, change [^\n]*\n to ([^\n]*\n){N}.
To not remove part of the RE use GNU awk and use gensub() instead of gsub().
To remove the blank lines, change the value of ORS.
Play with it...
This awk should work without storing full file in memory:
awk '/bar/{skip=1;next} skip && p~/^$/ {skip=0} NR>1 && !skip && !(p~/foo/){print p} {p=$0}
END{if (!skip && !(p~/foo/)) print p}' file
It is tasty!
stuff
something
stuff
things
things
things
One way:
awk '
/foo/ { next }
flag && NF { next }
flag && !NF { flag = 0 }
/bar/ { delete line[NR-1]; idx-=1; flag = 1; next }
{ line[++idx] = $0 }
END {
for (x=1; x<=idx; x++) print line[x]
}' file
It is tasty!
stuff
something
stuff
things
things
things
If line contains foo skip it.
If flag is enabled and line is not blank skip it.
If flag is enabled and line is blank disable the flag.
If line contains bar delete the previous line, reset the counter, enable the flag and skip it
Store all lines that manages through in array indexed at incrementing number
In the END block print the lines.
Side Notes:
To remove n number of lines before a pattern match, you can create a loop. Start with current line number and using a reverse for loop you can remove lines from your temporary cache (array). You can then subtract n from your self defined counter variable.
To include or exclude blank lines you can use the NF variable. For a typical line, NF variable is set to number of fields based on your field separator. For blank lines this variable is 0. For example, if you modify the line above END block to NF { line[++idx] = $0 } in the answer above you will see we have bypassed all blank lines from output.

search (e.g. awk, grep, sed) for string, then look for X lines above and another string below

I need to be able to search for a string (lets use 4320101), print 20 lines above the string and print after this until it finds the string
For example:
Random text I do not want or blank line
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
Random text I do not want or blank line
I just want the following result outputted to a file:
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
There are multiple examples of these groups of text in a file that I want.
I tried using this below:
cat filename | grep "</eventUpdate>" -A 20 4320101 -B 100 > greptest.txt
But it only ever shows for 20 lines either side of the string.
Notes:
- the line number the text is on is inconsistent so I cannot go off these, hence why I am using -A 20. - ideally I'd rather have it so when it searches after the string, it stops when it finds and then carries on searching.
Summary: find 4320101, output 20 lines above 4320101 (or one line of white space), and then output all lines below 4320101 up to
</eventUpdate>
Doing research I am unsure of how to get awk, nawk or sed to work in my favour to do this.
This might work for you (GNU sed):
sed ':a;s/\n/&/20;tb;$!{N;ba};:b;/4320102/!D;:c;n;/<\/eventUpdate>/!bc' file
EDIT:
:a;s/\n/&/20;tb;$!{N;ba}; this keeps a window of 20 lines in the pattern space (PS)
:b;/4320102!D; this moves the above window through the file until the pattern 4320102 is found.
:c;n;/<\/eventUpdate>/!bc the 20 line window is printed and any subsequent line until the pattern <\/eventUpdate> is found.
Here is an ugly awk solution :)
awk 'BEGIN{last=1}
{if((length($0)==0) || (Random ~ $0))last=NR}
/4320101/{flag=1;
if((NR-last)>20) last=NR-20;
cmd="sed -n \""last+1","NR-1"p \" input.txt";
system(cmd);
}
flag==1{print}
/eventUpdate/{flag=0}' <filename>
So basically what it does is keeps track of the last blank line or line containing Random pattern in the last variable. Now if the 4320101 has been found, it prints from that line -20 or last whichever is nearer through a system sed command. And sets the flag. The flag causes the next onwards lines to be printed till eventUpdate has been found. Have not tested though, but should be working
Look-behind in sed/awk is always tricky.. This self contained awk script basically keeps the last 20 lines stored, when it gets to 4320101 it prints these stored lines, up to the point where the blank or undesired line is found, then it stops. At that point it switches into printall mode and prints all lines until the eventUpdate is encountered, then it prints that and quits.
awk '
function store( line ) {
for( i=0; i <= 20; i++ ) {
last[i-1] = last[i]; i++;
};
last[20]=line;
};
function purge() {
for( i=20; i >= 0; i-- ) {
if( length(last[i])==0 || last[i] ~ "Random" ) {
stop=i;
break
};
};
for( i=(stop+1); i <= 20; i++ ) {
print last[i];
};
};
{
store($0);
if( /4320101/ ) {
purge();
printall=1;
next;
};
if( printall == 1) {
print;
if( /eventUpdate/ ) {
exit 0;
};
};
}' test
Let's see if I understand your requirements:
You have two strings, which I'll call KEY and LIMIT. And you want to print:
At most 20 lines before a line containing KEY, but stopping if there is a blank line.
All the lines between a line containing KEY and the following line containing LIMIT. (This ignores your requirement that there be no more than 100 such lines; if that's important, it's relatively straightforward to add.)
The easiest way to accomplish (1) is to keep a circular buffer of 20 lines, and print it out when you hit key. (2) is trivial in either sed or awk, because you can use the two-address form to print the range.
So let's do it in awk:
#file: extract.awk
# Initialize the circular buffer
BEGIN { count = 0; }
# When we hit an empty line, clear the circular buffer
length() == 0 { count = 0; next; }
# When we hit `key`, print and clear the circular buffer
index($0, KEY) { for (i = count < 20 ? 0 : count - 20; i < count; ++i)
print buf[i % 20];
hi = 0;
}
# While we're between key and limit, print the line
index($0, KEY),index($0, LIMIT)
{ print; next; }
# Otherwise, save the line
{ buf[count++ % 20] = $0; }
In order to get that to work, we need to set the values of KEY and LIMIT. We can do that on the command line:
awk -v "KEY=4320101" -v "LIMIT=</eventUpdate>" -f extract.awk $FILENAME
Notes:
I used index($0, foo) instead of the more usual /foo/, because it avoids having to escape regex special characters, and there is nowhere in the requirements that regexen are even desired. index(haystack, needle) returns the index of needle in haystack, with indices starting at 1, or 0 if needle is not found. Used as a true/false value, it is true of needle is found.
next causes processing of the current line to end. It can be quite handy, as this little program shows.
You can try something like this -
awk '{
a[NR] = $0
}
/<\/eventUpdate>/ {
x = NR
}
END {
for (i in a) {
if (a[i]~/4320101/) {
for (j=i-20;j<=x;j++) {
print a[j]
}
}
}
}' file
The simplest way is to use 2 passes of the file - the first to identify the line numbers in the range within which your target regexp is found, the second to print the lines in the selected range, e.g.:
awk '
NR==FNR {
if ($0 ~ /\<4320101\>/ {
for (i=NR-20;i<NR;i++)
range[i]
inRange = 1
}
if (inRange) {
range[NR]
}
if ($0 ~ /<\/eventUpdate>/) {
inRange = 0
}
next
}
FNR in range
' file file

line traversal in awk

I am doing a file traversal in awk. An example of this is
Dat time range column session - 1
time name place session animal - 2
hi bye name things - 3
In both of these . I need to traverse line by line and in I need to traverse word by word in the line that contains session .
Thus in this case I need to reach line 1 and 2 as it contains the word session and not line 3 as it doesn't contain that field(In the sense I can skip this). From there I need to traverse word by word to reach the session field .
I know $0 can represent the whole line. But my question is how to traverse word by word after reaching the line.
Could you please help me regarding this. Thank you.
You can loop through the current line $0 with this construct:
for(i = 1; i <= NF; i++) print $i
this makes use of the predefined awk variable NF which stands for the number of fields on the current line ($0).
You can examine the value of $i as it iterates through the line and based on that determine what to do with the value. E.g, print it, skip it, etc. if ($i == "session") ...
Update:
You can also use the match() function to determine if the current line you are processing contains the "session" string without iterating through the line. E.g.,
where = match($0, "session")
if (where > 0)
print "Found session in this line";
else
print "session not found in this line";
Note that match() takes a regular expression as the 2nd parameter, so your matches can be quite sophisticated. See this page for more information about this function and other awk string functions.
You can use a for loop, filtering only on the lines that contain "session":
awk '/session/{ for (i = 1; i <= NF; i++) { \
if ($i == "session") \
do_whatever_here \
} \
}'
You can read more on these instructions here: for, string comparison and if.

Resources