GNU sed 4.2.1 matching second occurence - windows

With this command I am trying to insert M09 between two patterns of M502 and the second line since M502 that looks like X41.5Y251.5T201.
Specifically I am trying to insert M09 before the last line.
sed -i "/M502/{:a;N;/\(.*\)T\([0-9]\+\)/!ba;N;s/\(.*\)\n\(.*\)T\([0-9]\+\)/\1\nM09\n\2T\3/}" *.nc
Result:
M502
M09
X1287.2Y353.28T324
..........
G27X-310.27
X41.5Y251.5T201
Should be:
M502
X1287.2Y353.28T324
..........
G27X-310.27
M09
X41.5Y251.5T201
I am trying to match the last line with \(.*\)T\([0-9]\+\), but that picks up the second line. How do I make sed ignore the first match of this expression?
I stress that I could have an undertermined amount of such text blocks and I need to insert M09 for every one.

With your shown samples, please try following awk code.
awk '
/.*T[0-9][0-9]*/{ found=1 }
FNR==1 { prev=$0; next }
{
print prev
prev=$0
}
END{
if(found) { print "M09" }
print prev
}
' Input_file
Above code will only print the values on terminal, in case you want to save output into Input_file itself then append > temp && mv temp Input_file to above code.

How do I match the second instance of \(.*\)T\([0-9]\+\) ? It must be an expression. Thanks for the help.
Assuming that you would like to insert some text just before the line that matches the the above regex for the 2th time, then you should consider using awk, as then you could do something like this:
$ awk '/.*T[0-9][0-9]*/ { n++ } !p && n == 2 { print "M09"; p = 1 } 1' input.txt
M502
X1287.2Y353.28T324
..........
G27X-310.27
M09
X41.5Y251.5T201
The breakdown of the above command is:
/.*T[0-9][0-9]*/ { n++ } # Everytime the regex match add one to `n`
!p && n == 2 { print "M09"; p = 1 } # If n = 2 and we havn't printed a header yet,
# then print "M09"
1 # Print every line

Related

How can I calculate the number of occurrences that is followed by a specific value? (add if statement)

How can I calculate the number of occurrences that are ONLY followed by a specific value that is after E*? e.g:'EXXXX' ?
file.txt:
E2dd,Rv0761,Rv1408
2s32,Rv0761,Rv1862,Rv3086
6r87,Rv0761
Rv2fd90c,Rv1408
Esf62,Rv0761
Evsf62,Rv3086
i tried
input:
awk -F, '{map[$2]++} END { for (key in map) { print key, map[key] } }' file.txt
and add:
if [[ $line2 == `E*` ]];then
but not working, have syntax error
Expected Output:
total no of occurrences:
Rv0761: 2
Rv3086:1
Now i can only count all number of occurrences of the second value
if [[ $line2 == `E*` ]];then
This definitely is not legal GNU AWK if statement, consult If Statement to find what is allowed, though it is not required in this case as you might as follows, let file.txt content be
E2dd,Rv0761,Rv1408
2s32,Rv0761,Rv1862,Rv3086
6r87,Rv0761
Rv2fd90c,Rv1408
Esf62,Rv0761
Evsf62,Rv3086
then
awk 'BEGIN{FS=","}($1~/^E/){map[$2]++} END { for (key in map) { print key, map[key] } }' file.txt
gives output
Rv3086 1
Rv0761 2
Explanation: actions (enclosed in {...}) could be preceeded by pattern, which does restrict their execution to lines which does match pattern (in other words: condition does hold) in above example pattern is $1~/^E/ which means 1st column does starts with E.
(tested in gawk 4.2.1)
You are so close. You are only missing the REGEX to identify records beginning with 'E' and then a ":" concatenated on output to produce your desired results (not in sort-order). For example you can do:
awk -F, '/^E/{map[$2]++} END { for (key in map) { print key ":", map[key] } }' file.txt
Example Output
With your data in file.txt you would get:
Rv3086: 1
Rv0761: 2
If you need the output sorted in some way, just pipe the output of the awk command to sort with whatever option you need.

AWK Script: Matching Two File with Unique Identifier and append record if already match

I'm trying to comparing two files with field as unique identifier to match.
With file 1 having account number and compare with second file.
If account number both file, next is condition to match the value and append to the original file or record.
Sample file 1:
ACCT1,PHONE1,TEST1
ACCT2,PHONE2,TEST3
Sample file 2:
ACCT1,SOMETHING1
ACCT1,SOMETHING3
ACCT1,SOMETHING1
ACCT1,SOMETHING3
ACCT2,SOMETHING1
ACCT2,SOMETHING3
ACCT2,SOMETHING1
ACCT2,SOMETHING1
But since the awk always gets the last occurrences of the file even there is already match before the end of record.
Actual Output base with condition below:
ACCT1,PHONE1,TEST1,000
ACCT2,PHONE2,TEST3,001
Expected Output:
ACCT1,PHONE1,TEST1,001
ACCT2,PHONE2,TEST3,001
Code I'm trying to:
awk -f test.awk pass=0 samplefile2.txt pass=1 samplefile1.txt > output.txt
BEGIN{
}
pass==0{
FS=","
ACT=$1
RES1[ACT]=$2
}
pass==1{
ACCTNO=$1
PHNO=$2
FIELD3=$3
LVCODE=RES1[ACCTNO]
if(LVCODE=="SOMETHING1"){ OTHERFLAG="001" }
else if(LVCODE=="SOMETHING4"){ OTHERFLAG="002" }
else{ OTHERFLAG="000" }
printf("%s\,", ACCTNO)
printf("%s\,", PHNO)
printf("%s\,", FIELD3)
printf("%s", OTHERFLAG)
printf "\n"
}
I'm trying to loop the variable that holds array, unfortunately it turns to infinite loop during my run.
You may use this awk command:
awk '
BEGIN {FS=OFS=","}
NR==FNR {
map[$1] = $0
next
}
$1 in map {
print map[$1], ($2 == "SOMETHING1" ? "001" : ($2 == "SOMETHING4" ? "002" : "000"))
delete map[$1]
}' file1 file2
ACCT1,PHONE1,TEST1,001
ACCT2,PHONE2,TEST3,001
Once we print a matching record from file2 we delete record from associative array map to ensure only first matching record is evaluated.
It sounds like you want to know the first occurrence of ACCTx in samplefile2.txt if SOMETHING1 or SOMETHING4 is present. I think you should read samplefile1.txt first into a data struction and then iterate line by line in samplefile2.txt looking for your criteria
BEGIN {
FS=","
while (getline < ACCOUNTFILE ) accounts[$1]=$0
}
{ OTHERFLAG = "" }
$2 == "SOMETHING1" { OTHERFLAG="001" }
$2 == "SOMETHING4" { OTHERFLAG="002" }
($1 in accounts) && OTHERFLAG!="" {
print(accounts[$1] "," OTHERFLAG)
# delete the accounts so that it does not print again.
# Only the first occurrence in samplefile2.txt will matter.
delete accounts[$1]
}
END {
# Print remaining accounts that did not match above
for (acct in accounts) print(accounts[acct] ",000")
}
Run above with:
awk -v ACCOUNTFILE=samplefile1.txt -f test.awk samplefile2.txt
I am not sure what you want to do if both SOMETHING1 and SOMETHING4 are in samplefile2.txt for the same ACCT1. If you want 'precedence' so that SOMETHING4 will overrule SOMETHING1 if it comes after you will need additional logic. In that case you probably want to avoid the 'delete' and keep updating the accounts[$1] array until you reach the end of the file and then print all the accounts at the end.

awk get the nextline

i'm trying to use awk to format a file thats contains multiple line.
Contains of file:
ABC;0;1
ABC;0;0;10
ABC;0;2
EFG;0;1;15
HIJ;2;8;00
KLM;4;18;12
KLM;6;18;1200
KLM;10;18;14
KLM;1;18;15
result desired:
ABC;0;1;10;2
EFG;0;1;15
HIJ;2;8;00
KLM;4;18;12;1200;14;15
I am using the code below :
awk -F ";" '{
ligne= ligne $0
ma_var = $1
{getline
if($1 != ma_var){
ligne= ligne "\n" $0
}
else {
ligne= ligne";"NF
}
}
}
END {
print ligne
} ' ${FILE_IN} > ${FILE_OUT}
the objectif is to compare the first column of the next line to the first column the current line, if it matches then add the last column of the next line to the current line, and delete the next line, else print the next line.
Kind regards,
As with life, it's a lot easier to make decisions based on what has happened (the previous line) than what will happen (the next line). Re-state your requirements as the objective is to compare the first column of the current line to the first column the previous line, if it matches then add the last column of the current line to the previous line, and delete the current line, else print the current line. and the code to implement it becomes relatively straight-forward:
$ cat tst.awk
BEGIN { FS=OFS=";" }
$1 == p1 { prev = prev OFS $NF; next }
{ if (NR>1) print prev; prev=$0; p1=$1 }
END { print prev }
$ awk -f tst.awk file
ABC;0;1;10;2
EFG;0;1;15
HIJ;2;8;00
KLM;4;18;12;1200;14;15
If you're ever tempted to use getline again, be sure you fully understand everything discussed at http://awk.freeshell.org/AllAboutGetline before making a decision.
I would take a slightly different approach than Ed:
$ awk '$1 == p { printf ";%s", $NF; next } NR > 1 { print "" } {p=$1;
printf "%s" , $0} END{print ""}' FS=\; input
At each line, check if the first column matches the previous. If it does, just print the last field. If it doesn't, print the whole line with no trailing newline.

Sequence length of FASTA file

I have the following FASTA file:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
My desired output:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
This is my code:
awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa
The output I get with this code is:
>header1
60
57
>header2
3
>header3
7
I need a small modification in order to deal with multiple sequence lines.
I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.
An awk / gawk solution can be composed by three stages:
Every time header is found these actions should be performed:
Print previous seqlen if exists.
Print tag.
Initialize seqlen.
For the sequence lines we just need to accumulate totals.
Finally at the END stage we print the remnant seqlen.
Commented code:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
A oneliner:
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
For the totals:
awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length($0)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqlen
}' file.fa
A quick way with any awk, would be this:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.
It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.
BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split($0, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length($0);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}
"seqkit" is a quick way:
seqkit fx2tab --length --name --header-line sequence.fa

match a pattern and print nth line if condition matches

My requirement is something like this:
Read File:
If ( line contains /String1/)
{
Increment cursor by 10 lines;
If (line contains /String2/ )
{ print line; }
}
so far I have got:
awk '/String1/{nr[NR]; nr[NR+10]}; NR in nr' file1.log
Result of this should input to:
awk 'match ($0 , /String2/) {print $0}' file1.log
How can I achieve it? Is there a better way?
Thanks.
You are close; you need to set the value of the array element.
awk '/String1/ { linematch[NR+10]=1; } /String2/ && NR in linematch;' file1.log
Each time a line matches String1, you save the record (line) number plus 10. Any time you match String2, check if the current line number is one we are expecting, and if so, print the line.
Here's another way to describe your algorithm. Instead of:
If ( line contains /String1/)
{
Increment cursor by 10 lines;
If (line contains /String2/ )
{ print line; }
}
which would require jumping ahead in your input stream somehow, think of it as:
If ( line contains /String2/)
{
If (line 10 lines previously contained /String1/ )
{ print line; }
}
which just requires you to re-visit what you already read in:
awk '/String1/{f[NR]} /String2/ && (NR-10) in f' file

Resources