Replace some lines in fasta file with appended text using while loop and if/else statement - bash

I am working with a fasta file and need to add line-specific text to each of the headers. So for example if my file is:
>TER1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
I want a while loop that will read through each line; for those with a > at the start, I want to append |population: plus the first three characters after the >. So line one would be:
>TER1|population:TER
etc.
I can't figure out how to make this work. Here my best attempt so far.
filename="testfasta.fa"
while read -r line
do
if [[ "$line" == ">"* ]]; then
id=$(cut -c2-4<<<"$line")
printf $line"|population:"$id"\n" >>outfile
else
printf $line"\n">>outfile
fi
done <"$filename"
This produces a file with the original headers and following line each on a single line.
Can someone tell me where I'm going wrong? My if and else loop aren't working at all!
Thanks!

You could use a while loop if you really want,
but sed would be simpler:
sed -e 's/^>\(...\).*/&|population:\1/' "$filename"
That is, for lines starting with > (pattern: ^>),
capture the next 3 characters (with \(...\)),
and match the rest of the line (.*),
replace with the line as it was (&),
and the fixed string |population:,
and finally the captured 3 characters (\1).
This will produce for your input:
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Or you can use this awk, also producing the same output:
awk '{sub(/^>.*/, $0 "|population:" substr($0, 2, 3))}1' "$filename"

You can do this quickly in awk:
awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' infile.txt > outfile.txt
$ awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' testfile
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Here awk will:
Test if the record starts with a > The $1 looks at the first field, but $0 for the entire record would work just as well in this case. The ~ will perform a regex test, and ^> means "Starts with >". Making the test: ($1~/^>/)
If so it will set the first field to the output you are looking for (using substr() to get the bits of the string you want. {$1=$1"|population:"substr($1,2,3)}
Finally it will print out the entire record (with the changes if applicable): {}1 which is shorthand for {print $0} or.. print the entire record.

Related

How to get values in a line while looping line by line in a file (shell script)

I have a file which looks like this (file.txt)
{"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here
{"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here
{"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here
{"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here
{"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here
{"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here
I want to loop trough this line by line an extract the key values.
so the result should be like ,
AJGUIGIDH568
AJGUIGIDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
So I wrote a code like this to loop line by line and extract the value between {"key":" and ","rule": because key values is in between these 2 patterns.
while read p; do
echo $p | sed -n "/{"key":"/,/","rule":,/p"
done < file.txt
But this is not working. can someone help me to figure out me this. Thanks in advance.
Your sample input is almost valid json. You could tweak it to make it valid and then extract the values with jq with something like:
sed -e 's/squid/"squid/' -e 's/$/"}/' file.txt | jq -r .key
Or, if your actual input really is valid json, then just use jq:
jq -r .key file.txt
If the "random-txt" may include double quotes, making it difficult to massage the input to make it valid json, perhaps you want something like:
awk '{print $4}' FS='"' file.txt
or
sed -n '/{"key":"\([^"]*\).*/s//\1/p' file.txt
or
while IFS=\" read open_brace key colon val _; do echo "$val"; done < file.txt
For the shown data, you can try this awk:
awk -F '"[:,]"' '{print $2}' file
AJGUIGIDH568
TJHJHJHDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
With the give example you can simple use
cut -d'"' -f4 file.txt
Assumptions:
there may be other lines in the file so we need to focus on just the lines with "key" and "rule"
the only text between "key" and "rule" is the desired string (eg, squid never shows up between the two patterns of interest)
Adding some additional lines:
$ cat file.txt
{"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here
ignore this line}
{"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here
ignore this line}
{"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
One sed idea:
$ sed -nE 's/^(.*"key":")([^"]*)(","rule".*)$/\2/p' file.txt
AJGUIGIDH568
TJHJHJHDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
Where:
-E - enable extended regex support (and capture groups without need to escape sequences)
-n - suppress printing of pattern space
^(.*"key":") - [1st capture group] everything from start of line up to and including "key":"
([^"]*) - [2nd capture group] everything that is not a double quote (")
(","rule".*)$ - [3rd capture group] everything from ",rule" to end of line
\2/p - replace the line with the contents of the 2nd capture group and print

sed insert line after a match only once [duplicate]

UPDATED:
Using sed, how can I insert (NOT SUBSTITUTE) a new line on only the first match of keyword for each file.
Currently I have the following but this inserts for every line containing Matched Keyword and I want it to only insert the New Inserted Line for only the first match found in the file:
sed -ie '/Matched Keyword/ i\New Inserted Line' *.*
For example:
Myfile.txt:
Line 1
Line 2
Line 3
This line contains the Matched Keyword and other stuff
Line 4
This line contains the Matched Keyword and other stuff
Line 6
changed to:
Line 1
Line 2
Line 3
New Inserted Line
This line contains the Matched Keyword and other stuff
Line 4
This line contains the Matched Keyword and other stuff
Line 6
You can sort of do this in GNU sed:
sed '0,/Matched Keyword/s//New Inserted Line\n&/'
But it's not portable. Since portability is good, here it is in awk:
awk '/Matched Keyword/ && !x {print "Text line to insert"; x=1} 1' inputFile
Or, if you want to pass a variable to print:
awk -v "var=$var" '/Matched Keyword/ && !x {print var; x=1} 1' inputFile
These both insert the text line before the first occurrence of the keyword, on a line by itself, per your example.
Remember that with both sed and awk, the matched keyword is a regular expression, not just a keyword.
UPDATE:
Since this question is also tagged bash, here's a simple solution that is pure bash and doesn't required sed:
#!/bin/bash
n=0
while read line; do
if [[ "$line" =~ 'Matched Keyword' && $n = 0 ]]; then
echo "New Inserted Line"
n=1
fi
echo "$line"
done
As it stands, this as a pipe. You can easily wrap it in something that acts on files instead.
If you want one with sed*:
sed '0,/Matched Keyword/s//Matched Keyword\nNew Inserted Line/' myfile.txt
*only works with GNU sed
This might work for you:
sed -i -e '/Matched Keyword/{i\New Inserted Line' -e ':a;n;ba}' file
You're nearly there! Just create a loop to read from the Matched Keyword to the end of the file.
After inserting a line, the remainder of the file can be printed out by:
Introducing a loop place holder :a (here a is an arbitrary name).
Print the current line and fetch the next into the pattern space with the ncommand.
Redirect control back using the ba command which is essentially a goto to the a place holder. The end-of-file condition is naturally taken care of by the n command which terminates any further sed commands if it tries to read passed the end-of-file.
With a little help from bash, a true one liner can be achieved:
sed $'/Matched Keyword/{iNew Inserted Line\n:a;n;ba}' file
Alternative:
sed 'x;/./{x;b};x;/Matched Keyword/h;//iNew Inserted Line' file
This uses the Matched Keyword as a flag in the hold space and once it has been set any processing is curtailed by bailing out immediately.
If you want to append a line after first match only, use AWK instead of SED as below
awk '{print} /Matched Keyword/ && !n {print "New Inserted Line"; n++}' myfile.txt
Output:
Line 1
Line 2
Line 3
This line contains the Matched Keyword and other stuff
New Inserted Line
Line 4
This line contains the Matched Keyword and other stuff
Line 6

Convert multi-line csv to single line using Linux tools

I have a .csv file that contains double quoted multi-line fields. I need to convert the multi-line cell to a single line. It doesn't show in the sample data but I do not know which fields might be multi-line so any solution will need to check every field. I do know how many columns I'll have. The first line will also need to be skipped. I don't how much data so performance isn't a consideration.
I need something that I can run from a bash script on Linux. Preferably using tools such as awk or sed and not actual programming languages.
The data will be processed further with Logstash but it doesn't handle double quoted multi-line fields hence the need to do some pre-processing.
I tried something like this and it kind of works on one row but fails on multiple rows.
sed -e :0 -e '/,.*,.*,.*,.*,/b' -e N -e '1n;N;N;N;s/\n/ /g' -e b0 file.csv
CSV example
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
The output I want is
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
Jane,Doe,Country City Street,67890
etc.
etc.
First my apologies for getting here 7 months late...
I came across a problem similar to yours today, with multiple fields with multi-line types. I was glad to find your question but at least for my case I have the complexity that, as more than one field is conflicting, quotes might open, close and open again on the same line... anyway, reading a lot and combining answers from different posts I came up with something like this:
First I count the quotes in a line, to do that, I take out everything but quotes and then use wc:
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
If you think of a single multi-line field, knowing if the quotes are 1 or 2 is enough. In a more generic scenario like mine I have to know if the number of quotes is odd or even to know if the line completes the record or expects more information.
To check for even or odd you can use the mod operand (%), in general:
even % 2 = 0
odd % 2 = 1
For the first line:
Odd means that the line expects more information on the next line.
Even means the line is complete.
For the subsequent lines, I have to know the status of the previous one. for instance in your sample text:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
You can say line 1 (John,Doe,"Country) has 1 quote (odd) what means the status of the record is incomplete or open.
When you go to line 2, there is no quote (even). Nevertheless this does not mean the record is complete, you have to consider the previous status... so for the lines following the first one it will be:
Odd means that record status toggles (incomplete to complete).
Even means that record status remains as the previous line.
What I did was looping line by line while carrying the status of the last line to the next one:
incomplete=0
cat file.csv | while read line; do
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
incomplete=$((($quotes+$incomplete)%2)) # Check if Odd or Even to decide status
if [ $incomplete -eq 1 ]; then
echo -n "$line " >> new.csv # If line is incomplete join with next
else
echo "$line" >> new.csv # If line completes the record finish
fi
done
Once this was executed, a file in your format generates a new.csv like this:
First name,Last name,Address,ZIP
John,Doe,"Country City Street",12345
I like one-liners as much as everyone, I wrote that script just for the sake of clarity, you can - arguably - write it in one line like:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
I would appreciate it if you could go back to your example and see if this works for your case (which you most likely already solved). Hopefully this can still help someone else down the road...
Recovering the multi-line fields
Every need is different, in my case I wanted the records in one line to further process the csv to add some bash-extracted data, but I would like to keep the csv as it was. To accomplish that, instead of joining the lines with a space I used a code - likely unique - that I could then search and replace:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l ~newline~ " || echo "$l";done >new.csv
the code is ~newline~, this is totally arbitrary of course.
Then, after doing my processing, I took the csv text file and replaced the coded newlines with real newlines:
sed -i 's/ ~newline~ /\n/g' new.csv
References:
Ternary operator: https://stackoverflow.com/a/3953666/6316852
Count char occurrences: https://stackoverflow.com/a/41119233/6316852
Other peculiar cases: https://www.linuxquestions.org/questions/programming-9/complex-bash-string-substitution-of-csv-file-with-multiline-data-937179/
TL;DR
Run this:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
... and collect results in new.csv
I hope it helps!
If Perl is your option, please try the following:
perl -e '
while (<>) {
$str .= $_;
}
while ($str =~ /("(("")|[^"])*")|((^|(?<=,))[^,]*((?=,)|$))/g) {
if (($el = $&) =~ /^".*"$/s) {
$el =~ s/^"//s; $el =~ s/"$//s;
$el =~ s/""/"/g;
$el =~ s/\s+(?!$)/ /g;
}
push(#ary, $el);
}
foreach (#ary) {
print /\n$/ ? "$_" : "$_,";
}' sample.csv
sample.csv:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
John,Doe,"Country
City
Street",67890
Result:
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
John,Doe,Country City Street,67890
This might work for you (GNU sed):
sed ':a;s/[^,]\+/&/4;tb;N;ba;:b;s/\n\+/ /g;s/"//g' file
Test each line to see that it contains the correct number of fields (in the example that was 4). If there are not enough fields, append the next line and repeat the test. Otherwise, replace the newline(s) by spaces and finally remove the "'s.
N.B. This may be fraught with problems such as ,'s between "'s and quoted "'s.
Try cat -v file.csv. When the file was made with Excel, you might have some luck: When the newlines in a field are a simple \n and the newline at the end is a \r\n (which will look like ^M), parsing is simple.
# delete all newlines and replace the ^M with a new newline.
tr -d "\n" < file.csv| tr "\r" "\n"
# Above two steps with one command
tr "\n\r" " \n" < file.csv
When you want a space between the joined line, you need an additional step.
tr "\n\r" " \n" < file.csv | sed '2,$ s/^ //'
EDIT: #sjaak commented this didn't work is his case.
When your broken lines also have ^M you still can be a lucky (wo-)man.
When your broken field is always the first field in double quotes and you have GNU sed 4.2.2, you can join 2 lines when the first line has exactly one double quote.
sed -rz ':a;s/(\n|^)([^"]*)"([^"]*)\n/\1\2"\3 /;ta' file.csv
Explanation:
-z don't use \n as line endings
:a label for repeating the step after successful replacement
(\n|^) Search after a newline or the very first line
([^"]*) Substring without a "
ta Go back to label a and repeat
awk pattern matching is working.
answer in one line :
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile
if you'd like to drop quotes, you could use:
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile | sed 's/"//gw NewFile'
but I prefer to keep it.
to explain the code:
/Pattern/ : find pattern in current line.
ORS : indicates the output line record.
$0 : indicates the whole of the current line.
's/OldPattern/NewPattern/': substitude first OldPattern with NewPattern
/g : does the previous action for all OldPattern
/w : write the result to Newfile

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

appending text to specific line in file bash

So I have a file that contains some lines of text separated by ','. I want to create a script that counts how much parts a line has and if the line contains 16 parts i want to add a new one. So far its working great. The only thing that is not working is appending the ',' at the end. See my example below:
Original file:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
Expected result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
This is my code:
while read p; do
if [[ $p == "HEA"* ]]
then
IFS=',' read -ra ADDR <<< "$p"
echo ${#ADDR[#]}
arrayCount=${#ADDR[#]}
if [ "${arrayCount}" -eq 16 ];
then
sed -i "/$p/ s/\$/,xx/g" $f
fi
fi
done <$f
Result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
What im doing wrong? I'm sure its something small but i cant find it..
It can be done using awk:
awk -F, 'NF==16{$0 = $0 FS "xx"} 1' file
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
-F, sets input field separator as comma
NF==16 is the condition that says execute block inside { and } if # of fields is 16
$0 = $0 FS "xx" appends xx at end of line
1 is the default awk action that means print the output
For using sed answer should be in the following:
Use ${line_number} s/..../..../ format - to target a specific line, you need to find out the line number first.
Use the special char & to denote the matched string
The sed statement should look like the following:
sed -i "${line_number}s/.*/&xx/"
I would prefer to leave it to you to play around with it but if you would prefer i can give you a full working sample.

Resources