grep for duration=N while N is longer than X, the position of this sentence changes between lines - bash

I have a very long file with a format of column_name=column_val, column_name2=column_val2 and so on.
the columns are not in the right order, lets say for example i have this file:
bar=x moshe=foo test=x duration=5
moshe=foo2 test=y duration=0 bar=y
duration=3 moshe=foo3 bar=z test=x
i want to return lines only where duration is greater then 2
as far as I know awk is not optional since i can't tell where the columns are location in each line.
on IRC in #bash channel someone recommended using gawk's match(). there too i was having problem seeing how to resolve this while each line the duration is elsewhere.
any ideas?
thanks

You can use duration= as field separator:
# showing field content in numeric context
$ awk -F'duration=' '{print +$2}' ip.txt
5
0
3
# use required numeric comparison to get desired output
$ awk -F'duration=' '+$2 > 2' ip.txt
bar=x moshe=foo test=x duration=5
duration=3 moshe=foo3 bar=z test=x
See https://www.gnu.org/software/gawk/manual/html_node/Strings-And-Numbers.html for conversion details
Unary + works on GNU awk, not sure about other versions. 0+$2 should work everywhere to force numeric context.
Note that if you have multiple duration= in a line, only the first one will be tested.

Extract the data with regex and compare.
awk '0+gensub(".*duration=([0-9]*).*", "\\1", "1") > 2'
#edit as above, the 0+ is needed to convert string to integer.

If you want to use grep:
grep -E 'duration=([3-9] |[0-9]{2,})' "file"

awk '/duration/ {
for (counter=1; counter <= NF; counter++) {
if ($counter ~ /^duration*/) {
value=substr($counter, index($counter,"=")+1);
if (value > 2) {
print $0;
}
}
}
}' <inputfile>

Related

How to print both the grep pattern and the resulting matched line on the same line?

I have two files File01 and File02.
File01, looks like this:
BU24DRAFT_430534
BU24DRAFT_488391
BU24DRAFT_488386
BU24DRAFT_417707
BU24DRAFT_417704
BU24DRAFT_488335
BU24DRAFT_429509
BU24DRAFT_210092
BU24DRAFT_229465
BU24DRAFT_498094
BU24DRAFT_416051
BU24DRAFT_482795
BU24DRAFT_4305
BU24DRAFT_10621
BU24DRAFT_4883
File02, looks like this:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79
Using the string from File01, via grep, I would like to identify the lines in File02 that match and with this information generate a file that would look like this:
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
I tried generating such file using the following code:
while read r;do CMD01=$(echo $r);CMD02=$(grep $r File01); echo "$CMD02 $CMD01";done < File02 | awk '(NR>1) && ($2 > 2 ) '
The problem I run into is that what I obtain extra matching lines:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4305
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4883
Where, for example, the string: BU24DRAFT_4305 is wrongly recognizing the string: XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
This result is incorrect. The string in File01 must match a string in File02 that has a complete version of File01's string
Any ideas that could help me will be appreciated.
For the updated sample input and full-matching requirement and assuming you never have any regexp metacharacters in file1 and that the matching strings in file2 are never at the start or end of the line:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if ($0 ~ ("[^[:alnum:]]"str"[^[:alnum:]]")) print $0, str}' file1 file2
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_430534
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
Original answer doing partial matching:
The correct approach is 1 call to awk:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if (index($0,str)) print $0, str}' file1 file2
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice and https://mywiki.wooledge.org/Quotes for some of the issues with the script in your question.
So, it looks like yours mostly works. A lot of what you are doing here is unnecessary. Here is your script broken into multiple lines for readability:
while read r; do
CMD01=$(echo $r)
CMD02=$(grep $r zztest01)
echo "$CMD02 $CMD01"
done < <(head zztest) | awk '(NR>1) && ($2 > 2 ) '
First, CMD01=$(echo $r): This is really the same (or intended to be) as CMD01="$r" so kind of useless.
Then, < <(head zztest): You are using head to output the contents of the file. This actually works just as well with a simple redirection like this: < zztest.
Last, | awk '(NR>1) && ($2 > 2 ) ': This appears to just be some sort of logic on whether we are going to print anything or not.
Here is a simplified version:
while read r; do
CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"
done < zztest
Explanation
CMD02=$(grep $r zztest01) && echo "$CMD02 $r": The main part of this is really two commands separated by &&. This means execute the second command if the first one succeeded. grep will return a "failure" code if it does not find what it is looking for. So, if grep does not find a match, echo will not run.
The output of grep will be stored in the variable $CMD02. Then, you will echo that along with $r for each match.
If you really want to keep this on one line like the original:
while read r; do CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"; done < zztest
Update
If you want to avoid partial matches as Ed asked, you can change the grep to this grep "$r[^0-9]" zztest01. This will avoid a match if there is a trailing digit after the initial match string (which is really an assumption given the sample).
While not explicit in the question, it seems that each pattern should only match single line in the input file (File02).
Based on this observation, possible to improve performance of the solution from Ed Morton:
awk '
NR==FNR{strs[$0]; next}
{ for (str in strs) if (index($0,str)) { print $0, str ; delete strs[str]; next } }
' file1 file2
For large files. with many patterns, it will reduce runtime by a factor of 4.

How to select text in a file until a certain string using grep, sed or awk?

I have a huge file (this is just a sample) and I would like to select all lines with "Ph_gUFAC1083" and all after until reach one that doesn't have the code (in this example Ph_gUFAC1139)
>uce_353_Ph_gUFAC1083 |uce_353
TTTAGCCATAGAAATGCAGAAATAATTAGAAGTGCCATTGTGTACAGTGCCTTCTGGACT
GGGCTGAAGGTGAAGGAGAAAGTATCATACTATCCTTGTCAGCTGCAAGGGTAATTACTG
CTGGCTGAAATTACTCAACATTTGTTTATAAGCTCCCCAGAGCATGCTGTAAATAGATTG
TCTGTTATAGTCCAATCACATTAAAACGCTGCTCCTTGCAAACTGCTACCTCCTGTTTTC
TGTAAGCTAGACAGAGAAAGCCTGCTGCTCACTTACTGAGCACCAAGCACTGAAGAGCTA
TGTTTAATGTGATTGTTTTCATTAGCTCTTCTCTGTCTGATATTACATTTATAATTTGCT
GGGCTTGAAGACTGGCATGTTGCATTGCTTTCATTTACTGTAGTAAGAGTGAATAGCTCT
AT
>uce_101_Ph_gUFAC1083 |uce_101
TTGGGCTTTATTTCCACCTTAAAATCTTTACCTGGCCGTGATCTGTTGTTCCATTACTGG
AGGGCAAAAATGGGAGGAATTGTCTGGGCTAAATTGCAATTAGGCAGCCCTGAGAGAGGC
TGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGT
AGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGAAGAG
GAGAGTTAATTGCATGTTACAGTGAGTGTAATGCCTAGATAACCTTGCATTTAATGCTAT
TCTTAGCCCTGCTGCCAAGACTTCTACAGAGCCTCTCTCTGCAGGAAGTCATTAAAGCTG
TGAGTAGATAATGCAGGCTCAGTGAAACCTAAGTGGCAACAATATA
>uce_171_Ph_gUFAC1083 |uce_171
CATGGAAAACGAGGAAAAGCCATATCTTCCAGGCCATTAATATTACTACGGAGACGTCTT
CATATCGCCGTAATTACAGCAGATCTCAAAGTGGCACAACCAAGACCAGCACCAAAGCTA
AAATAACTCGCAGGAGCAGGCGAGCTGCTTTTGCAGCCCTCAGTCCCAGAAATGCTCGGT
AGCTTTTCTTAAAATAGACAGCCTGTAAATAAGGTCTGTGAACTCAATTGAAGGTGGCTG
TTTCTGAATTAGTCAGCCCTCACAAGGCTCTCGGCCTACATGCTAGTACATAAATTGTCC
ACTTTACCACCAGACAAGAAAGATTAGAGTAATAAACACGGGGCATTAGCTCAGCTAGAG
AAACACACCAGCCGTTACGCACACGCGGGATTGCCAAGAACTGTTAACCCCACTCTCCAG
AAACGCACACAAAAAAACAAGTTAAAGCCATGACATCATGGGAA
>uce_4300_Ph_gUFAC1139 |uce_4300
ATTAAAAATACAATCCTCATGTTTGCATTTTGCAGTCGTCAACAAGAAATTGAAGAGAAA
CTCATAGAGGAAGAAACTGCTCGAAGGGTGGAAGAACTTGTAGCTAAACGCGTGGAAGAA
GAGCTGGAGAAAAGAAAGGATGAGATTGAGCGAGAGGTTCTCCGCAGGGTGGAGGAGGCT
AAGCGCATCATGGAAAAACAGTTGCTCGAAGAACTCGAGCGACAGCGACAAGCTGAACTT
GCAGCACAAAAAGCCAGAGAGGTAACGCTCGGTCGTTTGGAAAGTAGAGACAGTCCATGG
CAAAACTTTCAGTGTCGGTTTGTGCCTCCTGTTCGGTTCAGAAAGAGATGGAATACAGCA
AATCTAATTCCCTTCTCATATAAACTTGCATTGCTGCGAAACTTAATTTCTAGCCTATTC
AGAGGAGCTCACTGATATTTAAACAGTTACTCTCCTAAAACCTGAACAAGGATACTTGAT
TCTTAATGGAACTGACCTACATATTTCAGAATTGTTTGAAACTTTTGCCATGGCTGCAGG
ATTATTCAGCAGTCCTTTCATTTT
>uce_1039_Ph_gUFAC1139 |uce_1039
ATTAGTGGAATACAAATATGCAAAAACCAAACAGTTTGGTGCTATAATGTGAAAAGAAAT
TTACACCAATCTTATTTTTAATTTGTATGGGAACATTTTTACCACAAATTCCATATTTTA
ATAATACTATCCCAACTCTATTTTTTAGACTCATTTTGTCACTGTTTTGTAACAGAAACA
CTGTAAATATTATAGATGTGGTAAACTATTATACTTGTTTTCTTATAAATGAAATGATCT
GTGCCAACACTGACAAAATGAATTAATGTGTTACTAAGGCAACAGTCACATTATATGCTT
TCTCTTTCACAGTATGCGGTAGAGCATATGGTTTACTCTTAATGGAACACTAGCTTCTCA
TTAACATACCAGTAGCAATGTCAGAACTTACAAACCAGCATAACAGAGAAATGGAAAAAC
TTATAAATTAGACCCTTTCAGTATTATTGAGTAGAAAATGACTGATGTTCCAAGGTACAA
TATTTAGCTAATACAGTGCCCTTTTCTGCATCTTTCTTCTCAAAGGAAAAAAAAATCCTC
AAAAAAAACCAGAGCAAGAAACCTAACTTTTTCTTGT
I already tried several alternatives without success, the closest I reached was
sed -n '/Ph_gUFAC1083/, />/p' file.txt
that gave me that:
>uce_2347_Ph_gUFAC1083 |uce_2347
GCTTTTCTATGCAGATTTTTTCTAATTCTCTCCCTCCCCTTGCTTCTGTCAGTGTGAAGC
CCACACTAAGCATTAACAGTATTAAAAAGAGTGTTATCTATTAGTTCAATTAGACATCAG
ACATTTACTTTCCAATGTATTTGAAGACTGATTTGATTTGGGTCCAATCATTTAAAAATA
AGAGAGCAGAACTGTGTACAGAGCTGTGTACAGATATCTGTAGCTCTGAAGTCTTAATTG
CAAATTCAGATAAGGATTAGAAGGGGCTGTATCTCTGTAGACCAAAGGTATTTGCTAATA
CCTGAGATATAAAAGTGGTTAAATTCAATATTTACTAATTTAGGATTTCCACTTTGGATT
TTGATTAAGCTTTTTGGTTGAAAACCCCACATTATTAAGCTGTGATGAGGGAAAAAGCAA
CTCTTTCATAAGCCTCACTTTAACGCTTTATTTCAAATAATTTATTTTGGACCTTCTAAA
G
>uce_353_Ph_gUFAC1083 |uce_353
>uce_101_Ph_gUFAC1083 |uce_101
TTGGGCTTTATTTCCACCTTAAAATCTTTACCTGGCCGTGATCTGTTGTTCCATTACTGG
AGGGCAAAAATGGGAGGAATTGTCTGGGCTAAATTGCAATTAGGCAGCCCTGAGAGAGGC
TGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGT
AGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGAAGAG
GAGAGTTAATTGCATGTTACAGTGAGTGTAATGCCTAGATAACCTTGCATTTAATGCTAT
TCTTAGCCCTGCTGCCAAGACTTCTACAGAGCCTCTCTCTGCAGGAAGTCATTAAAGCTG
TGAGTAGATAATGCAGGCTCAGTGAAACCTAAGTGGCAACAATATA
>uce_171_Ph_gUFAC1083 |uce_171
Do you know how to do it using grep, sed or awk?
Thx
$ awk '/^>/{if(match($0,"Ph_gUFAC1083")){s=1} else s=0}s' file
I made a simple criteria for your request,
If the the start of the line is >, we're going to judge if "Ph_gUFAC1083" existed, if yes, set s=1, set s=0 otherwise.
For the line that doesn't start with >, the value of s would be retained.
The final s in the awk command decide if the line to be printed (s=1) or not (s=0).
If what you want is every line with Ph_gUFAC1139 plus block of lines after that line until the next line starting with >, then the following awk snippet might do:
$ awk 'BEGIN {RS=ORS=">"} /Ph_gUFAC1139/' file.txt
This uses the > character as a record separator, then simply displays records that contain the text you're interested in.
If you wanted to be able to provide the search string using a variable, you'd do it something like this:
$ val="Ph_gUFAC1139"
$ awk -v s="$val" 'BEGIN {RS=ORS=">"} $0 ~ s' file.txt
UPDATE
A comment mentions that the solution above shows trailing record separators rather than leading ones. You can adapt your output to match your input by reversing this order manually:
awk 'BEGIN { RS=ORS=">" } /Ph_gUFAC1139/ { printf "%s%s",ORS,$0 }' file.txt
Note that in the initial examples, a "match" of the regex would invoke awk's default "action", which is to print the line. The default action is invoked if no action is specified within the script. The code (immediately) above includes an action .. which prints the record, preceded by the separator.
This might work for you (GNU sed):
sed '/^>/h;G;/Ph_gUFAC1083/P;d' file
Store each line beginning with > in the hold space (HS) and then append the HS to every line. If any line contains the string Ph_gUFAC1083 print the first line in the pattern space (PS) and discard the everything else.
N.B. the regexp for the match may be amended to /\n.*Ph_gUFAC1083/ if the string match may occur in any line.
This program is used to find the block which starts with Ph_gUFAC1083 and ends with any statement other than Ph_gUFAC1139
cat inp.txt |
awk '
BEGIN{begin=0}
{
# Ignore blank lines
if( $0 ~ /^$/ )
{
print $0
next
}
# mark the line that contains Ph_gUFAC1083 and print it
if( $0 ~ /Ph_gUFAC1083/ )
{
begin=1
print $0
}
else
{
# if the line contains Ph_gUFAC1083 and Ph_gUFAC1139 was found before it, print it
if( begin == 1 && ( $0 ~ /Ph_gUFAC1139/ ) )
{
print $0
}
else
{
# found a line which doesnt contain Ph_gUFAC1139 , mark the end of the block.
begin = 0
}
}
}'

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Resources