Awk substring doesnt yield expected result - bash

I've a file whose content is below:
C2:0301,353458082243570,353458082243580,0;
C2:0301,353458082462440,353458082462450,0;
C2:0301,353458082069130,353458082069140,0;
C2:0301,353458082246230,353458082246240,0;
C2:0301,353458082559320,353458082559330,0;
C2:0301,353458080153530,353458080153540,0;
C2:0301,353458082462670,353458082462680,0;
C2:0301,353458081943950,353458081943960,0;
C2:0301,353458081719070,353458081719080,0;
C2:0301,353458081392470,353458081392490,0;
Field 2 and Field 3 (considering , as separator), contains 15 digit IMEI number ranges and not individual IMEI numbers. Usual format of IMEI is 8-digits(TAC)+6-digits(Serial number)+0(padded). The 6 digits(Serial number) part in the IMEI defines the start and end range, everything else remaining same. So in order to find individual IMEIs in the ranges (which is exactly what I want), I need a unary increment loop from 6 digits(Serial number) from the starting IMEI number in Field-2 till 6 digits(Serial number) from the ending IMEI number in Field-3. I am using the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
It gives me the below result:
353458082243570,0
353458082243580,0
353458082462440,0
353458082462450,0
353458082069130,0
353458082069140,0
353458082246230,0
353458082246240,0
353458082559320,0
353458082559330,0
353458080153530,0
353458082462670,0
353458082462680,0
353458081943950,0
353458081943960,0
353458081719070,0
353458081719080,0
353458081392470,0
353458081392480,0
353458081392490,0
The above is as expected except for the below line in the result:
353458080153530,0
The result is actually from the below line in the input file:
C2:0301,353458080153530,353458080153540,0;
But the expected output for the above line in input file is:
353458080153530,0
353458080153540,0
I need to know whats going wrong in my script.

The problem with your script is you start with 2 string variables, v and t, (typed as strings since they are the result of a string operation, substr()) and then convert one to a number with v++ which would strip leading zeros but then you're doing a string comparison with v <= t since a string (t) compared to a number or string or numeric string is always a string comparison. Yes you can add zero to each of the variables to force a numeric comparison but IMHO this is more like what you're really trying to do:
$ cat tst.awk
BEGIN { FS=","; re="(.{8})(.{6})(.*)" }
{
match($2,re,beg)
match($3,re,end)
for (i=beg[2]; i<=end[2]; i++) {
printf "%s%06d%s\n", end[1], i, end[3]
}
}
$ gawk -f tst.awk file
353458082243570
353458082243580
353458082462440
353458082462450
353458082069130
353458082069140
353458082246230
353458082246240
353458082559320
353458082559330
353458080153530
353458080153540
353458082462670
353458082462680
353458081943950
353458081943960
353458081719070
353458081719080
353458081392470
353458081392480
353458081392490
and when done with appropriate variables like that no conversion is necessary. Note also that with the above you don't need to repeatedly state the same or relative numbers to extract the part of the strings you care about, you just state the number of characters to skip (8) and the number to select (6) once. The above uses GNU awk for the 3rd arg to match().

The problem was in the while(v <= t) part of the script. I believe with leading 0s the match was not happening properly. So I ensured that they are casted into int while doing the comparison in the while loop. The AWK documentation says you can cast a value to int by using value+0. So my while(v <= t) in the awk script needed to change to while(v+0 <= t+0) . So the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
was changed to :
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v+0 <= t+0) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
That only change got me the expected value for the failure case. For example this in my input file:
C2:0301,353458080153530,353458080153540,0;
Now gives me individual IMEIs as :
353458080153530,0
353458080153540,0

Use an if statement that checks for leading zeros in variable v setting y accordingly:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) { if (substr(v,1,1)=="0") { v++;y="0"v } else { v++;y=v } ;printf %s%0"6"s%s,%s\n", substr($3,1,8),y,substr($3,15,2),$4;v=y } }' TEMP.OUT.merge_range_part1_21
Make sure that the while condition is contained in braces and also that v is incremented WITHIN the if conditions.
Set v=y at the end of the statement to allow this to work on additional increments.

Related

Awk printing out smallest and highest number, in a time format

I'm fairly new to linux/bash shell and I'm really having trouble printing two values (the highest and lowest) from a particular column in a text file. The file is formatted like this:
Geoff Audi 2:22:35.227
Bob Mercedes 1:24:22.338
Derek Jaguar 1:19:77.693
Dave Ferrari 1:08:22.921
As you can see the final column is a timing, I'm trying to use awk to print out the highest and lowest timing in the column. I'm really stumped, I've tried:
awk '{print sort -n < $NF}' timings.txt
However that didn't even seem to sort anything, I just received an output of:
1
0
1
0
...
Repeating over and over, it went on for longer but I didn't want a massive line of it when you get the point after the first couple iterations.
My desired output would be:
Min: 1:08:22.921
Max: 2:22:35.227
After question clarifications: if the time field always has a same number of digits in the same place, e.g. h:mm:ss.ss, the solution can be drastically simplified. Namely, we don't need to convert time to seconds to compare it anymore, we can do a simple string/lexicographical comparison:
$ awk 'NR==1 {m=M=$3} {$3<m&&m=$3; $3>M&&M=$3} END {printf("min: %s\nmax: %s",m,M)}' file
min: 1:08:22.921
max: 2:22:35.227
The logic is the same as in the (previous) script below, just using a simpler string-only based comparison for ordering values (determining min/max). We can do that since we know all timings will conform to the same format, and if a < b (for example "1:22:33" < "1:23:00") we know a is "smaller" than b. (If values are not consistently formatted, then by using the lexicographical comparison alone, we can't order them, e.g. "12:00:00" < "3:00:00".)
So, on first value read (first record, NR==1), we set the initial min/max value to the timing read (in the 3rd field). For each record we test if the current value is smaller than the current min, and if it is, we set the new min. Similarly for the max. We use short circuiting instead if to make expressions shorter ($3<m && m=$3 is equivalent to if ($3<m) m=$3). In the END we simply print the result.
Here's a general awk solution that accepts time strings with variable number of digits for hours/minutes/seconds per record:
$ awk '{split($3,t,":"); s=t[3]+60*(t[2]+60*t[1]); if (s<min||NR==1) {min=s;min_t=$3}; if (s>max||NR==1) {max=s;max_t=$3}} END{print "min:",min_t; print "max:",max_t}' file
min: 1:22:35.227
max: 10:22:35.228
Or, in a more readable form:
#!/usr/bin/awk -f
{
split($3, t, ":")
s = t[3] + 60 * (t[2] + 60 * t[1])
if (s < min || NR == 1) {
min = s
min_t = $3
}
if (s > max || NR == 1) {
max = s
max_t = $3
}
}
END {
print "min:", min_t
print "max:", max_t
}
For each line, we convert the time components (hours, minutes, seconds) from the third field to seconds which we can later simply compare as numbers. As we iterate, we track the current min val and max val, printing them in the END. Initial values for min and max are taken from the first line (NR==1).
Given your statements that the time field is actually a duration and the hours component is always a single digit, this is all you need:
$ awk 'NR==1{min=max=$3} {min=(min<$3?min:$3); max=(max>$3?max:$3)} END{print "Min:", min ORS "Max:", max}' file
Min: 1:08:22.921
Max: 2:22:35.227
You don't want to run sort inside of awk (even with the proper syntax).
Try this:
sed 1d timings.txt | sort -k3,3n | sed -n '1p; $p'
where
the first sed will remove the header
sort on the 3rd column numerically
the second sed will print the first and last line

Get common lines, for only specific fields, from multiple files

I am trying to understand the following code used to pull out overlapping lines over multiple files using BASH.
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
sprintf("\t%-20s -->\t%s", rec[R], R)
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}' file[a-d]
After understanding what each sub-block of code is doing, I would like to extend this code to find specific fields with overlap, rather than the entire line. For example, I have tried changing the line:
n = split(rec[R], t, "/")
to
n = split(rec[R$1], t, "/")
to find the lines where the first field is the same across all files but this did not work. Eventually I would like to extend this to check that a line has fields 1, 2, and 4 the same, and then print the line.
Specifically, for the files mentioned in the example in the link:
if file 1 is:
chr1 31237964 NP_055491.1 PUM1 M340L
chr1 33251518 NP_037543.1 AK2 H191D
and file 2 is:
chr1 116944164 NP_001533.2 IGSF3 R671W
chr1 33251518 NP_001616.1 AK2 H191D
chr1 57027345 NP_001004303.2 C1orf168 P270S
I would like to pull out:
file1/file2 --> chr1 33251518 AK2 H191D
I found this code at the following link:
http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738. Specifically, I would like to understand what R, rec, n, dup, and D represent from the files themselves. It is unclear from the comments provided and printf statements I've added within the subloops fail.
Thank you very much for any insight on this!
The script works by building an auxiliary array, the indices of which are the lines in the input files (denoted by $0 in rec[$0]), and the values are filename1/filename3/... for those filenames in which the given line $0 is present. You can hack it up to just work with $1,$2 and $4 like so:
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1) {
split(R,R1R2R4,SUBSEP)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
}
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the partial content of the current record
# (special concatenation of $1, $2 and $4)
# concatenating the filenames separated by / as values
rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
}' file[a-d]
this solution makes use of multidimensional arrays: we create rec[$1,$2,$4] instead of rec[$0]. This special syntax of awk concatenates the indices with the SUBSEP character, which is by default non-printable ("\034" to be precise), and so it is unlikely to be part of either of the fields. In effect it does rec[$1 SUBSEP $2 SUBSEP $4]=.... Otherwise this part of the code is the same. Note that it would be more logical to move the second block to the beginning of the script, and finish with the END block.
The first part of the code also has to be changed: now for (R in rec) loops over these tricky concatenated indices, $1 SUBSEP $2 SUBSEP $4. This is good while indexing, but you need to split R at the SUBSEP characters to obtain again the printable fields $1, $2, $4. These are put into the array R1R2R4, which can be used to print the necessary output: instead of %s,...,R we now have %s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],. In effect we're doing sprintf ...%s,...,$1,$2,$4; with pre-saved fields $1, $2, $4. For your input example this will print
records found in 2 files:
foo11.inp1/foo11.inp2 --> chr1 33251518 AK2
Note that the output is missing H191D but rightly so: that is not in field 1, 2 or 4 (but rather in field 5), so there's no guarantee that it is the same in the printed files! You probably don't want to print that, or anyway have to specify how you should treat the columns which are not checked between files (and so may differ).
A bit of explanation for the original code:
rec is an array, the indices of which are full lines of input, and the values are the slash-separated list of files in which those lines appear. For instance, if file1 contains a line "foo bar", then rec["foo bar"]=="file1" initially. If then file2 also contains this line, then rec["foo bar"]=="file1/file2". Note that there are no checks for multiplicity, so if file1 contains this line twice, then eventually you'll get rec["foo bar"]=file1/file1/file2 and obtain 3 for the number of files containing this line.
R goes over the indices of the array rec after it has been fully built. This means that R will eventually assume each unique line of every input file, allowing us to loop over rec[R], containing the filenames in which that specific line R was present.
n is a return value from split, which splits the value of rec[R] --- that is the filename list corresponding to line R --- at each slash. Eventually the array t is filled with the list of files, but we don't make use of this, we only use the length of the array t, i.e. the number of files in which line R is present (this is saved in the variable n). If n==1, we don't do anything, only if there are multiplicities.
the loop over n creates classes according to the multiplicity of a given line. n==2 applies to lines that are present in exactly 2 files. n==3 to those which appear thrice, and so on. What this loop does is that it builds an array dup, which for every multiplicity class (i.e. for every n) creates the output string "filename1/filename2/... --> R", with each of these strings separated by RS (the record separator) for each value of R that appears n times total in the files. So eventually dup[n] for a given n will contain a given number of strings in the form of "filename1/filename2/... --> R", concatenated with the RS character (by default a newline).
The loop over D in dup will then go through multiplicity classes (i.e. valid values of n larger than 1), and print the gathered output lines which are in dup[D] for each D. Since we only defined dup[n] for n>1, D starts from 2 if there are multiplicities (or, if there aren't any, then dup is empty, and the loop over D will not do anything).
first you'll need to understand the 3 blocks in an AWK script:
BEGIN{
# A code that is executed once before the data processing start
}
{
# block without a name (default/main block)
# executed pet line of input
# $0 contains all line data/columns
# $1 first column
# $2 second column, and so on..
}
END{
# A code that is executed once after all data processing finished
}
so you'll probably need to edit this part of the script:
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}

awk: Interpreting strings as mathematical expressions

Context: I have an input file that contains parameters with associated values followed by literal mathematical expressions such as:
PARAMETERS DEFINITION
A = 5; B = 2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
...
and I would like to get the strings of the second part to be interpreted as mathematical expressions so that I get the results of the expressions in my output file:
...
MATHEMATICAL EXPRESSIONS
10
0.2
...
What I did already: So far, using awk, I store all the parameters names and their corresponding values in two distinct arrays. I then replace each parameter with its value so that I am now in a similar situation as the author of this thread.
However, the answers s/he gets are not in awk except for the last one which is very specific to her/his situation, and hard to understand for me as a beginner with awk and shell scripting.
What I tried afterwards: As I have no clue how to do this in awk, the idea I had was to store the new field value in a variable, then use a shell command within the awk script like this:
#!bin/awk -f
BEGIN{}
{
myExpression=$1
system("echo $myExpression | bc")
}
END{}
This, unfortunately does not work as the variable is somehow not recognized by the echo command.
What I would like:
I would prefer a solution using awk alone with no call to external functions, however, I am not against one using a shell command if it is simpler.
EDIT Taking into account all the comments so far, I will be more precise, my input files look more like this:
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
The names of the parameters are complex enough so that any match of the string: "[param#]" within the document corresponds to a parameter that I want changed for its value.
Below is the way I manage to store the parameters and their value in arrays is the following:
{
if (match($2,/PARAMETERS_DEFINITION/) != 0) {paramSwitch = 1}
if (match($2,/MATHEMATICAL_EXPRESSIONS/) != 0) {paramSwitch = 0}
if (paramSwitch == 1)
{
parameterName[numOfParam] = $1 ;
parameterVal[numOfParam] = $3 ;
numOfParam += 1
}
}
Instead of this:
{
myExpression=$1
system("echo $myExpression | bc")
}
I think you'd want this:
{
myExpression=$1
system("echo " myExpression " | bc")
}
That's because in awk, assignments do not end up as environment variables, and putting strings next to each other concatenates them.
You asking awk: Interpreting strings as mathematical expressions - this functionality usually called as eval, and no, (AFAIK) awk doesn't knows such function. Therefore your questions is an typical XY problem
The right tool for this is bc, where you (nearly) don't need modify anything, and simply feed the bc with your input, only ensure than the variables are are lowercase, such the following input (edited the your example)
#PARAMETERS DEFINITION
a=5; b=2; c=1.5; d=7.5
#MATHEMATICAL EXPRESSIONS
a*b
c/d
using like
bc -l < inputfile
produces
10
.20000000000000000000
EDIT
For your edit, for the new input data. The following
grep '\[' inputfile | sed 's/[][]//g' | bc -l
for the input
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
produces the following output:
10
.20000000000000000000
e.g. grepping out only lines what contains [ - any param definition or expression, remove any [], e.g. creating the following bc program:
param1 = 5
param2 = 2
param3 = 1.5
param4 = 7.5
param1*param2
param3/param4
and send the whole "program" to bc...
Using BIDMAS as a basis i have created this mathematical function in awk
I have not included brackets(or indices) yet as they will require some extra effort but i may add them later
This awk script effectively works as bc does.
No system call required, all in awk.
Generic version for all applications
awk '{split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
For your specific example
awk '
/MATHEMATICAL_EXPRESSIONS/{z=1}
NR>1&&!z{split($0,y," = ");x[y[1]]=y[2]}
z&&/[\+\-\/\*]/{
for (n in x)gsub(n,x[n])
split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
There's something like an eval for awk, its a magical conversion when needed in the context, here adding +0 would do the convertion.
What I got for you (detailled version below) with a file named awkinput with your exemple input
awk '/[A-Z]=[0-9.]+;/ { for (i=1;i<=NF ;i++) { print "working on "$i; split($i,fields,"="); sub(/;/,"",fields[2]); params[fields[1]]=strtonum(fields[2]) } }; /[A-Z](*|\/|+|-)[A-Z]/ { for (p in params) { sub(p, params[p],$0); }; system("echo " $0 " | bc -ql") }' awkinput
Detailled:
/[A-Z]=[0-9.]+;?/ { # if we match something like A=4.2 with or wothout a ; at end
for (i=1;i<=NF ;i++) { # loop through the fields (separated by space, the default Field Separator of awk)
print "working on "$i; # inform on what we do
split($i,fields,"="); # split in an array to get param and value
sub(/;/,"",fields[2]); # Eventually remove the ; at end
params[fields[1]]=strtonum(fields[2]) # new array of parameters where the values are numeric
}
}
/[A-Z](*|\/|+|-)[A-Z]/ { #when the line match a math operation with one param on each side (at least)
for (p in params) { # loop over know params
sub(p, params[p],$0); # replace each param with its value
};
system("echo " $0 " | bc -ql") # print the result (no way to get of system call here)
}
Drawback:
A math of the form AB*C would be resolved to 52*1.5
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
$ awk -vRS='[= ;\n]' '{if ($0 ~ /[0-9]/){a[x] = $0; print x"="a[x]}else{x=$0}}/MATHEMATICAL/{print "MATHEMATICAL EXPRESSIONS"}{if ($0~"*") print a[substr($0,1,1)] * a[substr($0,3,1)]}{if ($0~"/") print a[substr($0,1,1)] / a[substr($0,3,1)]}' test
A=5
B=2
C=1.5
D=7.5
MATHEMATICAL EXPRESSIONS
10
0.2
Formatted nicely:
$ cat test.awk
# Store all variables in an array
{
if ($0 ~ /[0-9]/){
a[x] = $0;
print x " = " a[x] # Print the keys & values
}
else{
x = $0
}
}
# Print header
/MATHEMATICAL/ {print "MATHEMATICAL EXPRESSIONS"}
# Do the maths (case can work too, but it's not as widely available)
{
if ($0~"*")
print a[substr($0,1,1)] * a[substr($0,3,1)]
}
{
if ($0~"/")
print a[substr($0,1,1)] / a[substr($0,3,1)]
}
{
if ($0~"+")
print a[substr($0,1,1)] + a[substr($0,3,1)]
}
{
if ($0~"-")
print a[substr($0,1,1)] - a[substr($0,3,1)]
}
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
D+C
C-A
$ awk -f test.awk -vRS='[= ;\n]' test
A = 5
B = 2
C = 1.5
D = 7.5
MATHEMATICAL EXPRESSIONS
10
0.2
9
-3.5

search (e.g. awk, grep, sed) for string, then look for X lines above and another string below

I need to be able to search for a string (lets use 4320101), print 20 lines above the string and print after this until it finds the string
For example:
Random text I do not want or blank line
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
Random text I do not want or blank line
I just want the following result outputted to a file:
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
There are multiple examples of these groups of text in a file that I want.
I tried using this below:
cat filename | grep "</eventUpdate>" -A 20 4320101 -B 100 > greptest.txt
But it only ever shows for 20 lines either side of the string.
Notes:
- the line number the text is on is inconsistent so I cannot go off these, hence why I am using -A 20. - ideally I'd rather have it so when it searches after the string, it stops when it finds and then carries on searching.
Summary: find 4320101, output 20 lines above 4320101 (or one line of white space), and then output all lines below 4320101 up to
</eventUpdate>
Doing research I am unsure of how to get awk, nawk or sed to work in my favour to do this.
This might work for you (GNU sed):
sed ':a;s/\n/&/20;tb;$!{N;ba};:b;/4320102/!D;:c;n;/<\/eventUpdate>/!bc' file
EDIT:
:a;s/\n/&/20;tb;$!{N;ba}; this keeps a window of 20 lines in the pattern space (PS)
:b;/4320102!D; this moves the above window through the file until the pattern 4320102 is found.
:c;n;/<\/eventUpdate>/!bc the 20 line window is printed and any subsequent line until the pattern <\/eventUpdate> is found.
Here is an ugly awk solution :)
awk 'BEGIN{last=1}
{if((length($0)==0) || (Random ~ $0))last=NR}
/4320101/{flag=1;
if((NR-last)>20) last=NR-20;
cmd="sed -n \""last+1","NR-1"p \" input.txt";
system(cmd);
}
flag==1{print}
/eventUpdate/{flag=0}' <filename>
So basically what it does is keeps track of the last blank line or line containing Random pattern in the last variable. Now if the 4320101 has been found, it prints from that line -20 or last whichever is nearer through a system sed command. And sets the flag. The flag causes the next onwards lines to be printed till eventUpdate has been found. Have not tested though, but should be working
Look-behind in sed/awk is always tricky.. This self contained awk script basically keeps the last 20 lines stored, when it gets to 4320101 it prints these stored lines, up to the point where the blank or undesired line is found, then it stops. At that point it switches into printall mode and prints all lines until the eventUpdate is encountered, then it prints that and quits.
awk '
function store( line ) {
for( i=0; i <= 20; i++ ) {
last[i-1] = last[i]; i++;
};
last[20]=line;
};
function purge() {
for( i=20; i >= 0; i-- ) {
if( length(last[i])==0 || last[i] ~ "Random" ) {
stop=i;
break
};
};
for( i=(stop+1); i <= 20; i++ ) {
print last[i];
};
};
{
store($0);
if( /4320101/ ) {
purge();
printall=1;
next;
};
if( printall == 1) {
print;
if( /eventUpdate/ ) {
exit 0;
};
};
}' test
Let's see if I understand your requirements:
You have two strings, which I'll call KEY and LIMIT. And you want to print:
At most 20 lines before a line containing KEY, but stopping if there is a blank line.
All the lines between a line containing KEY and the following line containing LIMIT. (This ignores your requirement that there be no more than 100 such lines; if that's important, it's relatively straightforward to add.)
The easiest way to accomplish (1) is to keep a circular buffer of 20 lines, and print it out when you hit key. (2) is trivial in either sed or awk, because you can use the two-address form to print the range.
So let's do it in awk:
#file: extract.awk
# Initialize the circular buffer
BEGIN { count = 0; }
# When we hit an empty line, clear the circular buffer
length() == 0 { count = 0; next; }
# When we hit `key`, print and clear the circular buffer
index($0, KEY) { for (i = count < 20 ? 0 : count - 20; i < count; ++i)
print buf[i % 20];
hi = 0;
}
# While we're between key and limit, print the line
index($0, KEY),index($0, LIMIT)
{ print; next; }
# Otherwise, save the line
{ buf[count++ % 20] = $0; }
In order to get that to work, we need to set the values of KEY and LIMIT. We can do that on the command line:
awk -v "KEY=4320101" -v "LIMIT=</eventUpdate>" -f extract.awk $FILENAME
Notes:
I used index($0, foo) instead of the more usual /foo/, because it avoids having to escape regex special characters, and there is nowhere in the requirements that regexen are even desired. index(haystack, needle) returns the index of needle in haystack, with indices starting at 1, or 0 if needle is not found. Used as a true/false value, it is true of needle is found.
next causes processing of the current line to end. It can be quite handy, as this little program shows.
You can try something like this -
awk '{
a[NR] = $0
}
/<\/eventUpdate>/ {
x = NR
}
END {
for (i in a) {
if (a[i]~/4320101/) {
for (j=i-20;j<=x;j++) {
print a[j]
}
}
}
}' file
The simplest way is to use 2 passes of the file - the first to identify the line numbers in the range within which your target regexp is found, the second to print the lines in the selected range, e.g.:
awk '
NR==FNR {
if ($0 ~ /\<4320101\>/ {
for (i=NR-20;i<NR;i++)
range[i]
inRange = 1
}
if (inRange) {
range[NR]
}
if ($0 ~ /<\/eventUpdate>/) {
inRange = 0
}
next
}
FNR in range
' file file

line traversal in awk

I am doing a file traversal in awk. An example of this is
Dat time range column session - 1
time name place session animal - 2
hi bye name things - 3
In both of these . I need to traverse line by line and in I need to traverse word by word in the line that contains session .
Thus in this case I need to reach line 1 and 2 as it contains the word session and not line 3 as it doesn't contain that field(In the sense I can skip this). From there I need to traverse word by word to reach the session field .
I know $0 can represent the whole line. But my question is how to traverse word by word after reaching the line.
Could you please help me regarding this. Thank you.
You can loop through the current line $0 with this construct:
for(i = 1; i <= NF; i++) print $i
this makes use of the predefined awk variable NF which stands for the number of fields on the current line ($0).
You can examine the value of $i as it iterates through the line and based on that determine what to do with the value. E.g, print it, skip it, etc. if ($i == "session") ...
Update:
You can also use the match() function to determine if the current line you are processing contains the "session" string without iterating through the line. E.g.,
where = match($0, "session")
if (where > 0)
print "Found session in this line";
else
print "session not found in this line";
Note that match() takes a regular expression as the 2nd parameter, so your matches can be quite sophisticated. See this page for more information about this function and other awk string functions.
You can use a for loop, filtering only on the lines that contain "session":
awk '/session/{ for (i = 1; i <= NF; i++) { \
if ($i == "session") \
do_whatever_here \
} \
}'
You can read more on these instructions here: for, string comparison and if.

Resources