Find and delete a word followed by the next (n) lines in a file [duplicate] - shell

This question already has answers here:
How to pick multiple fasta sequences from a genes list
(4 answers)
Closed 2 years ago.
I have one file called unclassified. A sample of it looks like this (each is on a new line):
OTU3
OTU9
OTU10
OTU1
OTU6
OTU4
I have another file called OTUcounts. A sample of it looks like this
>OTU4
TACGTACGTAGCTAGTCGATCGTAGTGCTCGTCATCGTGCTGCTGCTAGCTAGCTAGCTCGTCGTACGTACGTACGTCGTAGTACGCTGCATGCATGCATCGTACGTACGTACGCTAGTCGACTGACTAGCTGACTAGCTAGCTAGCTAGCTAGCTACGTACGATCGTACGTACGTACGTAGCTAGCTACGTAGCTAGCTAGTAGCTAGCTACGTACGTCGTCGTGTCGTCGTTTGT
>OTU6
AACGGCTAGCTAGCTAGCTGCTCTACGTCGATCATCGATGTCAGACTGCGGCAGACTCGTACGTACGTCGTCAGTCGCATCATCAGTCAGTAGACTGCTAGCTCAGATCCGCATCGATCAGTCGACTGCATGCATCAGTCAGCTAGCATCAGTCAGTACGCTAGACTAGTAAGGGGGGGGGCGATGATCGTCGTGCTTATTAGTAGTTTGACCGCGGCGCGCGCGAGACTAGTCGTA
How would I search the OTUcounts file and delete the OTUs listed in the unclassified file, to ultimatley end up with a new file that looks like OTUcounts but with the unclassifieds removed?
I have started to use:
grep -x -f unclassified OTUcounts > newOTUcounts
but I know it needs more added - I am fairly new to this.
Any ideas?

You could use awk and store the OTU fields of unclassified in an array. When OTUcounts is read, test if the first field
is present in the array. If true, then set a flag and skip the next lines until the next OTU is found. Then reset the flag.
awk '
NR==FNR{a[$1]; next}
$1 in a{skip=1; next}
skip{if (/^OTU/){skip=0; print}next} 1
' unclassified OTUcounts > newOTUcounts
Explanation:
awk '
NR==FNR{ # if this is the first file...
a[$1] # save the first field in array `a` as array index
next # continue with the next line
}
$1 in a{ # if the first field is present in array `a`
skip=1 # set a flag to skip the next lines
next # continue with the next line
}
skip{ # if the flag is set
if (/^OTU/){ # if this is the next OTU
skip=0 # reset the flag
print # print the current line
}
next # continue with the next line
}
1 # print the current line
' unclassified OTUcounts > newOTUcounts

Try using v option of grep:
grep -v -f unclassified OTUcounts > newOTUcounts

We can indeed do it with only grep, by first generating a list of the OTUs to be kept and then using the --after-context option to print the three lines of context; finally we have to remove the line containing a group separator (--) which grep places between contiguous groups of matches.
grep OTU OTUcounts|grep -vwfunclassified|grep -xf- -A2 OTUcounts|grep -ve-- >newOTUcounts

An alternative approach that uses GNU sed (And a shell like bash or zsh that understands <(command) redirection):
gsed -f <(while read otu; do echo "/^>${otu}\$/,+2d"; done < unclassified) OTUcounts > newOTUcounts
It turns each line of the unclassified file into a sed command that deletes any case of that OTU and the next two lines - OTU3, for example, is transformed into /^>OTU3$/,+2d

Related

Using awk command to compare values on separate lines?

I am trying to build a bash script that uses the awk command to go through a sorted tab-separated file, line-by-line and determine if:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
If this is true, I want to add the values from fields 1 and 3 from the line and then field 4 from the next line to a file. For context, after sorting, the input file looks like:
molecule gene start end strand
ERR2661861.3269 JN051170.1 11330 10778 minus
ERR2661861.3269 JN051170.1 11904 11348 minus
ERR2661861.3269 JN051170.1 12418 11916 minus
ERR2661861.3269 JN051170.1 13000 12469 minus
ERR2661861.3269 JN051170.1 13382 13932 plus
ERR2661861.3269 JN051170.1 13977 14480 plus
ERR2661861.3269 JN051170.1 14491 15054 plus
ERR2661861.3269 JN051170.1 15068 15624 plus
ERR2661861.3269 JN051170.1 15635 16181 plus
Thus, in this example, the script should find the statement true when comparing lines 4 and 5 and append the following line to a file:
ERR2661861.3269 13000 13382
The script that I have thus far is:
# test input file
file=Eg2.1.txt.out
#sort the file by 'molecule' field, then 'start' field
sort -k1,1 -k3n $file > sorted_file
# create output file and add 'molecule' 'start' and 'end' headers
echo molecule$'\t'start$'\t'end >> Test_file.txt
# for each line of the input file, do this
for i in $sorted_file
do
# check to see if field 1 on current line is the same as field 1 on next line AND if field 5 on current line is "minus" AND if field 5 on next line is "plus"
if [awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}'] && [awk '{if(NR==i) print $5}' == "minus"] && [awk '{if(NR==i+1) print $5}' == "plus"];
# if this is true, then get the 1st and 3rd fields from current line and 4th field from next line and add this to the output file
then
mol=awk '{if(NR==i) print $1}'
start=awk '{if(NR==i) print $3}'
end=awk '{if(NR==i+1) print $4}'
new_line=$mol$'\t'$start$'\t'$end
echo new_line >> Test_file.txt
fi
done
The first part of the bash script works as I want it but the for loop does not seem to find any hits in the sorted file. Does anyone have any insights or suggestions for why this might not be working as intended?
Many thanks in advance!
Explanation why your code does not work
For a better solution to your problem see karakfa's answer.
String comparison in bash needs spaces around [ and ]
Bash interprets your command ...
[awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}']
... as the command [awk with the arguments {if(NR..., ==, awk, and {if(NR...]. On your average system there is no command named [awk, therefore this should fail with an error message. Add a space after [ and before ].
awk wasn't executed
[ awk = awk ] just compares the literal string awk. To execute the commands and compare their outputs use [ "$(awk)" = "$(awk)" ].
awk is missing the input file
awk '{...}' tries to read input from stdin (the user, in your case). Since you want to read the file, add it as an argument: awk '{...}' sorted_file
awk '... NR==i ...' is not referencing the i from bash's for i in
awk does not know about your bash variable. When you write i in your awk script, that i will always have the default value 0. To pass a variable from bash to awk use awk -v i="$i" .... Also, it seems like you assumed for i in would iterate over the line numbers of your file. Right now, this is not the case, see the next paragraph.
for i in $sorted_file is not iterating the file sorted_file
You called your file sorted_file. But when you write $sorted_file you reference a variable that wasn't declared before. Undeclared variables expand to the empty string, therefore you iterate nothing.
You probably wanted to write for i in $(cat sorted_file), but that would iterate over the file content, not the line numbers. Also, the unquoted $() can cause unforsen problems depending on the file content. To iterate over the line numbers, use for i in $(seq $(wc -l sorted_file)).
this will do the last step, assumes data is sorted in the key and "minus" comes before "plus".
$ awk 'NR==1{next} $1==p && f && $NF=="plus"{print p,v,$3} {p=$1; v=$3; f=$NF=="minus"}' sortedfile
ERR2661861.3269 13000 13382
Note that awk has an implicit loop, no need force it to iterate externally.
The best thing to do when comparing adjacent lines in a stream using awk, or any other program for that matter, is to store the relevant data of that line and then compare as soon as both lines have been read, like in this awk script.
molecule = $1
strand = $5
if (molecule==last_molecule)
if (last_strand=="minus")
if (strand=="plus")
print $1,end,$4
last_molecule = molecule
last_strand = strand
end = $3
You essentially described a proto-program in your bullet points:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
You have everything needed to write a program in Perl, awk, ruby, etc.
Here is Perl version:
perl -lanE 'if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") {say join("\t", #F[0..2])}
$l0=$F[0]; $l4=$F[4];' sorted_file
The -lanE part enables auto split (like awk) and auto loop and compiles the text as a program;
The if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") tests your three bullet points (but Perl is 0 based index arrays so 'first' is 0 and fifth is 4)
The $l0=$F[0]; $l4=$F[4]; saves the current values of field 1 and 5 to compare next loop through. (Both awk and perl allow comparisons to non existent variables; hence why $l0 and $l4 can be used in a comparison before existing on the first time through this loop. Most other languages such as ruby they need to be initialized first...)
Here is an awk version, same program essentially:
awk '($1==l1 && l5=="minus" && $5=="plus"){print $1 "\t" $2 "\t" $3}
{l1=$1;l5=$5}' sorted_file
Ruby version:
ruby -lane 'BEGIN{l0=l4=""}
puts $F[0..2].join("\t") if (l0==$F[0] && l4=="minus" && $F[4]=="plus")
l0=$F[0]; l4=$F[4]
' sorted_file
All three print:
ERR2661861.3269 JN051170.1 13382
My point is that you very effectively understood and stated the problem you were trying to solve. That is 80% of solving it! All you then needed is the idiomatic details of each language.

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Update version number in property file using bash

I am new in bash scripting and I need help with awk. So the thing is that I have a property file with version inside and I want to update it.
version=1.1.1.0
and I use awk to do that
file="version.properties"
awk -F'["]' -v OFS='"' '/version=/{
split($4,a,".");
$4=a[1]"."a[2]"."a[3]"."a[4]+1
}
;1' $file > newFile && mv newFile $file
but I am getting strange result version="1.1.1.0""...1
Could someone help me please with this.
You mentioned in your comment you want to update the file in place. You can do that in a one-liner with perl:
perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i version.properties
Explanation
-e is followed by a script to run. With -p and -i, the effect is to run that script on each line, and modify the file in place if the script changes anything.
The script itself, broken down for explanation, is:
/^version=/ and # Do the following on lines starting with `version=`
s/ # Make a replacement on those lines
(\d+\.\d+\.\d+\.)(\d+)/ # Match x.y.z.w, and set $1 = `x.y.z.` and $2 = `w`
$1 . ($2+1)/ # Replace x.y.z.w with a copy of $1, followed by w+1
e # This tells Perl the replacement is Perl code rather
# than a text string.
Example run
$ cat foo.txt
version=1.1.1.2
$ perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i foo.txt
$ cat foo.txt
version=1.1.1.3
This is not the best way, but here's one fix.
Test case
I am assuming the input file has at least one line that is exactly version=1.1.1.0.
$ awk -F'["]' -v OFS='"' '/version=/{
> split($4,a,".");
> $4=a[1]"."a[2]"."a[3]"."a[4]+1
> }
> ;1' <<<'version=1.1.1.0'
Output:
version=1.1.1.0"""...1
The """ is because you are assigning to field 4 ($4). When you do that, awk adds field separators (OFS) between fields 1 and 2, 2 and 3, and 3 and 4. Three OFS => """, in your example.
Minimal change
$ awk -F'["]' -v OFS='"' '/version=/{
split($1,a,".");
$1=a[1]"."a[2]"."a[3]"."a[4]+1;
print
}
' <<<'version=1.1.1.0'
version=1.1.1.1
Two changes:
Change $4 to $1
Since the input field separator (-F) is ["], $4 is whatever would be after the third " (if there were any in the input). Therefore, split($4, ...) splits an empty field. The contents of the line, before the first " (if any), are in $1.
print at the end instead of ;1
The 1 after the closing curly brace is the next condition, and there is no action specified. The default action is to print the current line, as modified, so the 1 triggers printing. Instead, just print within your action when you are done processing. That way your action is self-contained. (Of course, if you needed to do other processing, you might want to print later, after that processing.)
You can use the = as the delimiter, like this:
awk -F= -v v=1.0.1 '$1=="version"{printf "version=\"%s\"\n", v}' file.properties

Replace some lines in fasta file with appended text using while loop and if/else statement

I am working with a fasta file and need to add line-specific text to each of the headers. So for example if my file is:
>TER1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
I want a while loop that will read through each line; for those with a > at the start, I want to append |population: plus the first three characters after the >. So line one would be:
>TER1|population:TER
etc.
I can't figure out how to make this work. Here my best attempt so far.
filename="testfasta.fa"
while read -r line
do
if [[ "$line" == ">"* ]]; then
id=$(cut -c2-4<<<"$line")
printf $line"|population:"$id"\n" >>outfile
else
printf $line"\n">>outfile
fi
done <"$filename"
This produces a file with the original headers and following line each on a single line.
Can someone tell me where I'm going wrong? My if and else loop aren't working at all!
Thanks!
You could use a while loop if you really want,
but sed would be simpler:
sed -e 's/^>\(...\).*/&|population:\1/' "$filename"
That is, for lines starting with > (pattern: ^>),
capture the next 3 characters (with \(...\)),
and match the rest of the line (.*),
replace with the line as it was (&),
and the fixed string |population:,
and finally the captured 3 characters (\1).
This will produce for your input:
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Or you can use this awk, also producing the same output:
awk '{sub(/^>.*/, $0 "|population:" substr($0, 2, 3))}1' "$filename"
You can do this quickly in awk:
awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' infile.txt > outfile.txt
$ awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' testfile
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Here awk will:
Test if the record starts with a > The $1 looks at the first field, but $0 for the entire record would work just as well in this case. The ~ will perform a regex test, and ^> means "Starts with >". Making the test: ($1~/^>/)
If so it will set the first field to the output you are looking for (using substr() to get the bits of the string you want. {$1=$1"|population:"substr($1,2,3)}
Finally it will print out the entire record (with the changes if applicable): {}1 which is shorthand for {print $0} or.. print the entire record.

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

Resources