how to extract lines between two patterns only with awk? - shell

$ awk '/abc/{flag=1; next} /edf/{flag=0} flag' file
flag will print $0, but I only need the first matching lines from two strings.
input:
abc
111
222
edf
333
444
abc
555
666
edf
output:
111
222

So I'm assuming you want to print out the matching lines only for 1st occurrence.
For that you can just use an additional variable and set it once flag goes 0
$ cat file
abc
111
222
edf
333
444
abc
555
666
edf
$ awk '/abc/{flag=1; next} /edf/{if(flag) got1stoccurence=1; flag=0} flag && !got1stoccurence' file
111
222

If you only want the first set of output, then:
awk '/abc/{flag=1; next} /edf/{if (flag == 1) exit} flag' file
Or:
awk '/abc/{flag++; next} /edf/{if (flag == 1) flag++} flag == 1' file
There are other ways to do it too, no doubt. The first is simple and to the point. The second is more flexible if you also want to process the first group of lines appearing between another pair of patterns.
Note that if the input file contains:
xyz
edf
pqr
abc
111
222
edf
It is important not to do anything about the first edf; it is an uninteresting line because no abc line has been read yet.

Using getline with while:
$ awk '/abc/ { while(getline==1 && $0!="edf") print; exit }' file
111
222
Look for /abc/ and once found records will be outputed in the while loop until edf is found.

$ awk '/edf/{exit} f; /abc/{f=1}' file
111
222
If it was possible for edf to appear before abc in your input then it'd be:
$ awk 'f{if (/edf/) exit; print} /abc/{f=1}' file
111
222

Related

Finding changes in one field between files where another field is the same

I am comparing two hashsets of my data from a year ago and through a series of bashing I have cut the two files down to just a hash value and the filename. We are talking close to 2 million entries.
From this great answer here I have been able to confirm where the hashes exist in both files, and where they don't exist in one and do in the other (eg, the second set has 40K files added to it, whereas there are only 4 files missing from the first set—that just don't appear in the second set).
I could verify that 40K files were added from old to new via:
awk 'FNR==NR{a[$1]=1;next}!($1 in a)' oldfile newfile | wc -l
and swapping the files around, I could see that only 4 files were missing.
I then realised I was basing this on hash alone. I'd actually like to base this on the filename.
Swapping the fieldnumber I was able to confirm a slightly different set of numbers. The additions to the newfile were not problem, but I noticed there were only 3 files missing from the first set.
Now what I want to do is take this to the next level and confirm the number of files that exist in both locations (easy enough):
awk 'FNR==NR{a[$2]=1;next}($2 in a)' oldfile newfile | wc -l
but where the first field will be different.
:~/working-hashset$ head file?
==> file1 <==
111 abc
222 def
333 ghi
444 jkl
555 fff
666 sss
777 vvv
==> file2 <==
111 abc
212 def
333 ggi
454 jjl
555 fff
656 sss
777 vss
:~/working-hashset$ awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(($1 in a)) print $0;}' file1 file2
111 abc
555 fff
:~/working-hashset$ awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(!($1 in a)) print $0;}' file1 file2
212 def
656 sss
:~/working-hashset$
This has been a work in progress (just writing this question which I started hours ago, I have solved some problems already... moving along).
I am at the stage where I have tested both files and have been able to detect hash collisions, good hashes, deleted files and new files.
:~/working-hashset$ head file?
==> file1 <==
111 dir1/aaa Original good
222 dir1/bbb Original changed
333 dir1/ccc Original good will move
444 dir1/ddd Original change and moved
555 dir2/eee Deleted
666 dir2/fff Hash Collision
999 dir2/zzz Deleted
==> file2 <==
111 dir1/aaa Good
2X2 dir1/bbb Changed
333 dir3/ccc Moved but good
4X4 dir3/ddd Moved and changed
111 dir4/aaa Duplicated
666 dir4/fzf Hash Collision
777 dir5/ggg New file
888 dir5/hhh New file
:~/working-hashset$ cat hashutil
#!/usr/bin/env bash
echo Unique to file 1
awk 'FNR==NR{a[$1]=1;b[$2];next}!($2 in b)' file2 file1 # in 1, !in 2
echo
echo Unique to file 2
awk 'FNR==NR{a[$1]=1;b[$2];next}!($2 in b)' file1 file2 # in 2, !in 1
echo
echo In both files and good
awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(($1 in a)) print $0;}' file2 file1 # in both files and good
echo
echo In both files, wrong hash
awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(!($1 in a)) print $0;}' file2 file1 # in both files and wrong hash
echo
echo hash collision
awk 'FNR==NR{a[$1]=1;b[$2];next}!($2 in b) {if(($1 in a)) print $0;}' file1 file2 # hash collision
echo
echo Done!
And this is the output:
Unique to file 1
333 dir1/ccc Original good will move
444 dir1/ddd Original change and moved
555 dir2/eee Deleted
666 dir2/fff Hash Collision
999 dir2/zzz Deleted
Unique to file 2
333 dir3/ccc Moved but good
4X4 dir3/ddd Moved and changed
111 dir4/aaa Duplicated
666 dir4/fzf Hash Collision
777 dir5/ggg New file
888 dir5/hhh New file
In both files and good
111 dir1/aaa Original good
In both files, wrong hash
222 dir1/bbb Original changed
hash collision
333 dir3/ccc Moved but good
111 dir4/aaa Duplicated
666 dir4/fzf Hash Collision
Done!
I now want to detect MOVED files.
I know that I'm going to need to break this into further "chunks" but they're going to be further delimited by the forward slash and at different levels.
I know about the number of fields (NF) and that I want to compare the first field (delimited by space) against the last field (delimited by slash) and upon matching then compare by the rest. If it's all the same then it's the same, else if the third condition only is different, it's moved.
I just don't even know where to start now (being 4am isn't helping)
Any help is appreciated.

gawk use to replace a line containing a pattern with multiple lines using variable

I am trying to replace a line containing the Pattern using gawk, with a set of lines. Let's say, file aa contains
aaaa
ccxyzcc
aaaa
ddxyzdd
I'm using gawk to replace all lines containing xyz with a set of lines 111\n222, my changed contents would contain:
aaaa
111
222
aaaa
111
222
But, if I use:
gawk -v nm2="111\n222" -v nm1="xyz" '{ if (/nm1/) print nm2;else print $0}' "aa"
The changed content shows:
aaaa
ccxyzcc
aaaa
ddxyzdd
I need the entire lines those contain xyz i.e. lines ccxyzcc and ddxyzdd having to be replaced with 111 followed by 222. Please help.
The problem with your code was that /nm1/ tries to match nm1 as pattern not the value in nm1 variable
$ gawk -v nm2="111\n222" -v nm1="xyz" '$0 ~ nm1{print nm2; next} 1' aa
aaaa
111
222
aaaa
111
222
Thanks #fedorqui for suggestion, next can be avoided by simply overwriting content of input line matching the pattern with required text
gawk -v nm2="111\n222" -v nm1="xyz" '$0 ~ nm1{$0=nm2} 1' aa
Solution with GNU sed
$ nm1='xyz'
$ nm2='111\n222'
$ sed "/$nm1/c $nm2" aa
aaaa
111
222
aaaa
111
222
The c command would delete the line matching pattern and add the text given
When using awk's ~ operator, and you don't need to provide a literal regex on the right-hand side.
Your command as-such with the correction of improper syntax would be something like,
gawk -v nm2="111\n222" -v nm1="xyz" '{ if ( $0 ~ nm1 ) print nm2;else print $0}' input-file
which produces the output.
aaaa
111
222
aaaa
111
222
This is how I'd do it:
$ cat aa
aaaa
ccxyzcc
aaaa
ddxyzdd
$ awk '{gsub(/.*xyz.*/, "111\n222")}1' aa
aaaa
111
222
aaaa
111
222
$
Passing variables as patterns to awk is always a bit tricky.
awk -v nm2='111\n222' '{if ($1 ~ /xyz/){ print nm2 } else {print}}'
will give you the output, but the 'xyz' pattern is now fixed.
Passing nm1 as shell variable will also work:
nm1=xyz
awk -v nm2='111\n222' '{if ($1 ~ /'$nm1'/){ print nm2 } else {print}}' aa

How to print columns one after the other in bash?

Is there any better methods to print two or more columns into one column, for example
input.file
AAA 111
BBB 222
CCC 333
output:
AAA
BBB
CCC
111
222
333
I can only think of:
cut -f1 input.file >output.file;cut -f2 input.file >>output.file
But it's not good if there are many columns, or when I want to pipe the output to other commands like sort.
Any other suggestions? Thank you very much!
With awk
awk '{if(maxc<NF)maxc=NF;
for(i=1;i<=NF;i++){(a[i]!=""?a[i]=a[i]RS$i:a[i]=$i)}
}
END{
for(i=1;i<=maxc;i++)print a[i]
}' input.file
You can use a GNU awk array of arrays to store all the data and print it later on.
If the number of columns is constant, this works for any amount of columns:
gawk '{for (i=1; i<=NF; i++) # loop over columns
data[i][NR]=$i # store in data[column][line]
}
END {for (i=1;i<=NR;i++) # loop over lines
for (j=1;j<=NF;j++) # loop over columns
print data[i][j] # print the given field
}' file
Note NR stands for number of records (that is, number of lines here) and NF stands for number of fields (that is, the number of fields in a given line).
If the number of columns changes over rows, then we should use yet another array, in this case to store the number of columns for each row. But in the question I don't see a request for this, so I am leaving it for now.
See a sample with three columns:
$ cat a
AAA 111 123
BBB 222 234
CCC 333 345
$ gawk '{for (i=1; i<=NF; i++) data[i][NR]=$i} END {for (i=1;i<=NR;i++) for (j=1;j<=NF;j++) print data[i][j]}' a
AAA
BBB
CCC
111
222
333
123
234
345
If the number of columns is not constant, using an array to store the number of columns for each row helps to keep track of it:
$ cat sc.wk
{for (i=1; i<=NF; i++)
data[i][NR]=$i
columns[NR]=NF
}
END {for (i=1;i<=NR;i++)
for (j=1;j<=NF;j++)
print (i<=columns[j] ? data[i][j] : "-")
}
$ cat a
AAA 111 123
BBB 222
CCC 333 345
$ awk -f sc.wk a
AAA
BBB
CCC
111
222
333
123
-
345
awk '{print $1;list[i++]=$2}END{for(j=0;j<i;j++){print list[j];}}' input.file
Output
AAA
BBB
CCC
111
222
333
More simple solution would be
awk -v RS="[[:blank:]\t\n]+" '1' input.file
Expects tab as delimiter:
$ cat <(cut -f 1 asd) <(cut -f 2 asd)
AAA
BBB
CCC
111
222
333
Since the order is of no importance:
$ awk 'BEGIN {RS="[ \t\n]+"} 1' file
AAA
111
BBB
222
CCC
333
Ugly, but it works-
for i in {1..2} ; do awk -v p="$i" '{print $p}' input.file ; done
Change the {1..2} to {1..n} where 'n' is the number of columns in the input file
Explanation-
We're defining a variable p which itself is the variable i. i varies from 1 to n and at each step we print the 'i'th column of the file.
This will work for an arbitrary number fo space separated colums
awk '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file
If space is not the separateor ... let's suppose ":" is the separator
awk -F: '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file

Replace line after match

Given this file
$ cat foo.txt
AAA
111
BBB
222
CCC
333
I would like to replace the first line after BBB with 999. I came up with this command
awk '/BBB/ {f=1; print; next} f {$1=999; f=0} 1' foo.txt
but I am curious to any shorter commands with either awk or sed.
This might work for you (GNU sed)
sed '/BBB/!b;n;c999' file
If a line contains BBB, print that line and then change the following line to 999.
!b negates the previous address (regexp) and breaks out of any processing, ending the sed commands, n prints the current line and then reads the next into the pattern space, c changes the current line to the string following the command.
This is some shorter:
awk 'f{$0="999";f=0}/BBB/{f=1}1' file
f {$0="999";f=0} if f is true, set line to 999 and f to 0
/BBB/ {f=1} if pattern match set f to 1
1 print all lines, since 1 is always true.
can use sed also, it's shorter
sed '/BBB/{n;s/.*/999/}'
$ awk '{print (f?999:$0); f=0} /BBB/{f=1}' file
AAA
111
BBB
999
CCC
333
awk '/BBB/{print;getline;$0="999"}1' your_file
sed 's/\(BBB\)/\1\
999/'
works on mac

Transpose Columns in a single comma separated row conditionally

I have an input file that looks like this:
aaa 111
aaa 222
aaa 333
bbb 444
bbb 555
I want to create a transposed output file that looks like this:
aaa 111,222,333
bbb 444,555
How can I do this using awk, sed, etc?
One way using awk:
$ awk '{a[$1]=a[$1]?a[$1]","$2:$2}END{for(k in a)print k,a[k]}' file
aaa 111,222,333
bbb 444,555
And if your implementation of awk doesn't support the ternary operator then:
$ awk 'a[$1]{a[$1]=a[$1]","$2;next}{a[$1]=$2}END{for(k in a)print k,a[k]}' file
aaa 111,222,333
bbb 444,555
Your new file does not cause any problems for the script, what output are you getting? I suspect it's probably a line ending issue. Run dos2unix file to fix the line ending.
$ cat file
APM00065101435 189
APM00065101435 190
APM00065101435 191
APM00065101435 390
190104555 00C7
190104555 00D1
190104555 00E1
190104555 0454
190104555 0462
$ awk '{a[$1]=a[$1]?a[$1]","$2:$2}END{for(k in a)print k,a[k]}' file
APM00065101435 189,190,191,390
190104555 00C7,00D1,00E1,0454,0462
Code for GNU sed:
I made a question for this and got a very good & useful answer from potong:
sed -r ':a;$!N;s/^(([^ ]+ ).*)\n\2/\1,/;ta;P;D' file
sed -r ':a;$!N;s/^((\S+\s).*)\n\2/\1,/;ta;P;D' file

Resources