Transform multiple files in a directory in unix - shell

I have a folder with the name Translated_cds.
in this folder, there are 52 text files. these are FASTA files that have information about proteins.
>lcl|NZ_JPMI01000003.1_prot_WP_043388330.1_1 [locus_tag=Q664_RS00010] [protein=HAMP domain-containing protein] [protein_id=WP_043388330.1] [location=complement(30..1904)] [gbkey=CDS]
MRIRTRLLLLLIVTAAVPTLAVGLLAWRDAERALSEAVAEQHRRTALAEAEHAATHVLSLATELGGALVHQEPLELGPSE
AQEFLIRVFLRRDRIAQVGLFDARGQLTASVFVDDPEAFARQEPQFRRHDTVAAGEVEDFQRRASELLSQVPEGRAYAIS
APYLTGVRRRPAVVVAARAPGTRTGGLAAELGLEELSQRLAARGVGDERVFLLDGAGRLLLDGEPERERHEDFTGKLPGA
VGARQTGLAAYEEEGRAWLAAYSPVPELGWVAVVARPREAALAPLHALARSTYGVLGLTLLGVLALALMLARALARPIAR
LAEGARALARGNLAHRISLKRRDELGDLARAFNDMGQALEQAHRELLGFNEQLAAQVEERTRELQQTQVQLSRSQRLAAM
GDLAAGMAHEMNNPLAAVLGNVQLMLMDLPKEDPSHRMLGTVHQQAQRIASIVRELQLLSERQQLGRLPLDLHRMLQRVL
ESRCAELSQVGVHVDCRFHPGEVKVLGDTQALGDVLGRLLGNALNAMRDRPERNLVLSTQVVDAEVVRVEMKDTGRGIAR
EHLERIFNPFFTTKQQWTGKGLSLAVCHRVIEDHGGTITLDSVEGVGTTVTLVLPAAPASSGLV
the line starting with > (called the header)is present in all the files. I want to replace the gap ' ' in the headers with _.
till now i have tried this
sed -i 's/ /_/g' Translated_cds*

We can lead with /^>/ to gate the substitution so that it isolates to the pattern we are interested in:
sed -i -e '/^>/ s/ /_/g' Translated_cds*
My test:
echo '>lcl|NZ_JPMI01000003.1_prot_WP_043388330.1_1 [locus_tag=Q664_RS00010] [protein=HAMP domain-containing protein] [protein_id=WP_043388330.1] [location=complement(30..1904)] [gbkey=CDS]
MRIRTRLLLLLIVTAAVPTLAVGLLAWRDAERALSEAVAEQHRRTALAEAEHAATHVLSLATELGGALVHQEPLELGPSE
AQEFLIRVFLRRDRIAQVGLFDARGQLTASVFVDDPEAFARQEPQFRRHDTVAAGEVEDFQRRASELLSQVPEGRAYAIS
APYLTGVRRRPAVVVAARAPGTRTGGLAAELGLEELSQRLAARGVGDERVFLLDGAGRLLLDGEPERERHEDFTGKLPGA
VGARQTGLAAYEEEGRAWLAAYSPVPELGWVAVVARPREAALAPLHALARSTYGVLGLTLLGVLALALMLARALARPIAR
LAEGARALARGNLAHRISLKRRDELGDLARAFNDMGQALEQAHRELLGFNEQLAAQVEERTRELQQTQVQLSRSQRLAAM
GDLAAGMAHEMNNPLAAVLGNVQLMLMDLPKEDPSHRMLGTVHQQAQRIASIVRELQLLSERQQLGRLPLDLHRMLQRVL
ESRCAELSQVGVHVDCRFHPGEVKVLGDTQALGDVLGRLLGNALNAMRDRPERNLVLSTQVVDAEVVRVEMKDTGRGIAR
EHLERIFNPFFTTKQQWTGKGLSLAVCHRVIEDHGGTITLDSVEGVGTTVTLVLPAAPASSGLV' | sed -e '/^>/ s/ /_/g'
My result:
>lcl|NZ_JPMI01000003.1_prot_WP_043388330.1_1_[locus_tag=Q664_RS00010]_[protein=HAMP_domain-containing_protein]_[protein_id=WP_043388330.1]_[location=complement(30..1904)]_[gbkey=CDS]
MRIRTRLLLLLIVTAAVPTLAVGLLAWRDAERALSEAVAEQHRRTALAEAEHAATHVLSLATELGGALVHQEPLELGPSE
AQEFLIRVFLRRDRIAQVGLFDARGQLTASVFVDDPEAFARQEPQFRRHDTVAAGEVEDFQRRASELLSQVPEGRAYAIS
APYLTGVRRRPAVVVAARAPGTRTGGLAAELGLEELSQRLAARGVGDERVFLLDGAGRLLLDGEPERERHEDFTGKLPGA
VGARQTGLAAYEEEGRAWLAAYSPVPELGWVAVVARPREAALAPLHALARSTYGVLGLTLLGVLALALMLARALARPIAR
LAEGARALARGNLAHRISLKRRDELGDLARAFNDMGQALEQAHRELLGFNEQLAAQVEERTRELQQTQVQLSRSQRLAAM
GDLAAGMAHEMNNPLAAVLGNVQLMLMDLPKEDPSHRMLGTVHQQAQRIASIVRELQLLSERQQLGRLPLDLHRMLQRVL
ESRCAELSQVGVHVDCRFHPGEVKVLGDTQALGDVLGRLLGNALNAMRDRPERNLVLSTQVVDAEVVRVEMKDTGRGIAR
EHLERIFNPFFTTKQQWTGKGLSLAVCHRVIEDHGGTITLDSVEGVGTTVTLVLPAAPASSGLV
If we want only the spaces within the keyword/value tags of the header replaced, then:
sed -i -e '/^>/ s/\([A-Za-z0-9]\) \([[A-Za-z0-9]\)/\1_\2/g' Translated_cds*
Or.... We can clarify a bit with more modern regex:
sed -i -E '/^>/ s/([[:alnum:]]) ([[:alnum:]])/\1_\2/g' Translated_cds*
The result will change only inside the header's keyword/value tags:
>lcl|NZ_JPMI01000003.1_prot_WP_043388330.1_1 [locus_tag=Q664_RS00010] [protein=HAMP_domain-containing_protein] [protein_id=WP_043388330.1] [location=complement(30..1904)] [gbkey=CDS]
MRIRTRLLLLLIVTAAVPTLAVGLLAWRDAERALSEAVAEQHRRTALAEAEHAATHVLSLATELGGALVHQEPLELGPSE
AQEFLIRVFLRRDRIAQVGLFDARGQLTASVFVDDPEAFARQEPQFRRHDTVAAGEVEDFQRRASELLSQVPEGRAYAIS
APYLTGVRRRPAVVVAARAPGTRTGGLAAELGLEELSQRLAARGVGDERVFLLDGAGRLLLDGEPERERHEDFTGKLPGA
VGARQTGLAAYEEEGRAWLAAYSPVPELGWVAVVARPREAALAPLHALARSTYGVLGLTLLGVLALALMLARALARPIAR
LAEGARALARGNLAHRISLKRRDELGDLARAFNDMGQALEQAHRELLGFNEQLAAQVEERTRELQQTQVQLSRSQRLAAM
GDLAAGMAHEMNNPLAAVLGNVQLMLMDLPKEDPSHRMLGTVHQQAQRIASIVRELQLLSERQQLGRLPLDLHRMLQRVL
ESRCAELSQVGVHVDCRFHPGEVKVLGDTQALGDVLGRLLGNALNAMRDRPERNLVLSTQVVDAEVVRVEMKDTGRGIAR
EHLERIFNPFFTTKQQWTGKGLSLAVCHRVIEDHGGTITLDSVEGVGTTVTLVLPAAPASSGLV

Related

Add space within a line

I have many files named a, b, c and so on. These files contain line like this:-
11.077-105.882
-22.134-302.321
-1.011-201.254
I want to add a space when - sign come in mid of line. I want my output file look like this:-
11.077 -105.882
-22.134 -302.321
-1.011 -201.254
I have tried this command:-
cat a |sed 's/-/ -/g' >out.txt
But it do not give desired result
Require (and capture) a character before each - to replace:
$ sed 's/\(.\)-/\1 -/g' < tmp.txt
11.077 -105.882
-22.134 -302.321
-1.011 -201.254
This will only match a - that is not line-initial, and will include the preceding character in the replacement text.
You could combine 2 sed commands:
$ sed 's/-/ -/g' a | sed 's/^ //'
11.077 -105.882
-22.134 -302.321
-1.011 -201.254
Or, in a single line solution add whitespaces only before - that come after a digit:
$ sed 's,\([0-9]\)-,\1 -,' a
11.077 -105.882
-22.134 -302.321
-1.011 -201.254

String manipulation via script

I am trying to get a substring between &DEST= and the next & or a line break.
For example :
MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546
In this I need to extract "SFO"
MYREQUESTISTO8764GETTHIS&DEST=SANFRANSISCO&ORIG=6546
In this I need to extract "SANFRANSISCO"
MYREQUESTISTO8764GETTHISWITH&DEST=SANJOSE
In this I need to extract "SANJOSE"
I am reading a file line by line, and I need to update the text after &DEST= and put it back in the file. The modification of the text is to mask the dest value with X character.
So, SFO should be replaced with XXX.
SANJOSE should be replaced with XXXXXXX.
Output :
MYREQUESTISTO8764GETTHIS&DEST=XXX&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=XXXXXXXXXXXX&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=XXXXXXX
Please let me know how to achieve this in script (Preferably shell or bash script).
Thanks.
$ cat file
MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=PORTORICA
MYREQUESTISTO8764GETTHIS&DEST=SANFRANSISCO&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=SANJOSE
$ sed -E 's/^.*&DEST=([^&]*)[&]*.*$/\1/' file
SFO
PORTORICA
SANFRANSISCO
SANJOSE
should do it
Replacing airports with an equal number of Xs
Let's consider this test file:
$ cat file
MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=SANFRANSISCO&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=SANJOSE
To replace the strings after &DEST= with an equal length of X and using GNU sed:
$ sed -E ':a; s/(&DEST=X*)[^X&]/\1X/; ta' file
MYREQUESTISTO8764GETTHIS&DEST=XXX&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=XXXXXXXXXXXX&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=XXXXXXX
To replace the file in-place:
sed -i -E ':a; s/(&DEST=X*)[^X&]/\1X/; ta' file
The above was tested with GNU sed. For BSD (OSX) sed, try:
sed -Ee :a -e 's/(&DEST=X*)[^X&]/\1X/' -e ta file
Or, to change in-place with BSD(OSX) sed, try:
sed -i '' -Ee :a -e 's/(&DEST=X*)[^X&]/\1X/' -e ta file
If there is some reason why it is important to use the shell to read the file line-by-line:
while IFS= read -r line
do
echo "$line" | sed -Ee :a -e 's/(&DEST=X*)[^X&]/\1X/' -e ta
done <file
How it works
Let's consider this code:
search_str="&DEST="
newfile=chart.txt
sed -E ':a; s/('"$search_str"'X*)[^X&]/\1X/; ta' "$newfile"
-E
This tells sed to use Extended Regular Expressions (ERE). This has the advantage of requiring fewer backslashes to escape things.
:a
This creates a label a.
s/('"$search_str"'X*)[^X&]/\1X/
This looks for $search_str followed by any number of X followed by any character that is not X or &. Because of the parens, everything except that last character is saved into group 1. This string is replaced by group 1, denoted \1 and an X.
ta
In sed, t is a test command. If the substitution was made (meaning that some character needed to be replaced by X), then the test evaluates to true and, in that case, ta tells sed to jump to label a.
This test-and-jump causes the substitution to be repeated as many times as necessary.
Replacing multiple tags with one sed command
$ name='DEST|ORIG'; sed -E ':a; s/(&('"$name"')=X*)[^X&]/\1X/; ta' file
MYREQUESTISTO8764GETTHIS&DEST=XXX&ORIG=XXXX
MYREQUESTISTO8764GETTHIS&DEST=XXXXXXXXXXXX&ORIG=XXXX
MYREQUESTISTO8764GETTHISWITH&DEST=XXXXXXX
Answer for original question
Using shell
$ s='MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546'
$ s=${s#*&DEST=}
$ echo ${s%%&*}
SFO
How it works:
${s#*&DEST=} is prefix removal. This removes all text up to and including the first occurrence of &DEST=.
${s%%&*} is suffix removal_. It removes all text from the first & to the end of the string.
Using awk
$ echo 'MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546' | awk -F'[=\n]' '$1=="DEST"{print $2}' RS='&'
SFO
How it works:
-F'[=\n]'
This tells awk to treat either an equal sign or a newline as the field separator
$1=="DEST"{print $2}
If the first field is DEST, then print the second field.
RS='&'
This sets the record separator to &.
With GNU bash:
while IFS= read -r line; do
[[ $line =~ (.*&DEST=)(.*)((&.*|$)) ]] && echo "${BASH_REMATCH[1]}fooooo${BASH_REMATCH[3]}"
done < file
Output:
MYREQUESTISTO8764GETTHIS&DEST=fooooo&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=fooooo&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=fooooo
Replace the characters between &DEST and & (or EOL) with x's:
awk -F'&DEST=' '{
printf("%s&DEST=", $1);
xlen=index($2,"&");
if ( xlen == 0) xlen=length($2)+1;
for (i=0;i<xlen;i++) printf("%s", "X");
endstr=substr($2,xlen);
printf("%s\n", endstr);
}' file

sed: Replacing a range of text with contents of a file

There are many examples here and elsewhere on the interwebs for using sed's 'r' to replace a pattern, but it does not seem to work on a range, but maybe I'm just not holding it right.
The following works as expected, deleting BEGIN PATTERN and replacing it with the contents of /tmp/somefile.
sed -n "/BEGIN PATTERN/{ r /tmp/somefile d }" TARGET_FILE
This, however, only replaces END_PATTERN with the contents of /tmp/somefile.
sed -n "/BEGIN PATTERN/,/END PATTERN/ { r /tmp/somefile d }" TARGET_FILE
I suppose I could try perl or awk to do this as well, but it seems like sed should be able to do this.
I believe that this does what you want:
sed $'/BEGIN PATTERN/r somefile\n /BEGIN PATTERN/,/END PATTERN/d' file
Or:
sed -e '/BEGIN PATTERN/r somefile' -e '/BEGIN PATTERN/,/END PATTERN/d' file
How it works
/BEGIN PATTERN/r somefile
Whenever BEGIN PATTERN is found, this inserts the contents of somefile.
/BEGIN PATTERN/,/END PATTERN/d
Whenever we are in the range from a line with /BEGIN PATTERN/ to a line with /END PATTERN/, we delete (d) the contains of the pattern buffer.
Example
Let's consider these two test files:
$ cat file
prelude
BEGIN PATTERN
middle
END PATTERN
afterthought
and:
$ cat somefile
This is
New.
Our command produces:
$ sed $'/BEGIN PATTERN/r somefile\n /BEGIN PATTERN/,/END PATTERN/d' file
prelude
This is
New.
afterthought
This might work for you (GNU sed):
sed -e '/BEGIN PATTERN/,/END PATTERN/{/END PATTERN/!d;r somefile' -e 'd}' file
John1024's answer works if BEGIN PATTERN and END PATTERN are different. If this is not the case, the following works:
sed $'/PATTERN/,/PATTERN/d; 1,/PATTERN/ { /PATTERN/r somefile\n }' file
By preserving the pattern:
sed $'/PATTERN/,/PATTERN/ { /PATTERN/!d; }; 1,/PATTERN/ { /PATTERN/r somefile\n }' file
This solution can yield false positives if the pattern is not paired as potong pointed out.

sed right align a group of text

this question originated from string pattaren-matching using awk , basically we are splitting a line of text in multiple groups based on a regex pattern, and then printing two groups only. Now the question is can we right align a group while printing through sed?
below is an example
$cat input.txt
it is line one
it is longggggggg one
itttttttttt is another one
now
$sed -e 's/\(.*\) \(.*\) \(.*\) \(.*\)/\1 \3/g' input.txt
it splits and prints group 1 and 3, but the output is
it line
it longggggggg
itttttttttt another
my question is can we do it through sed so that the output comes as
it line
it longggggggg
itttttttttt another
I did it with awk but I feel it can be done through sed, but I am not able to get how I am going to get the length of the second group and then pad correct number of spaces in between the groups, I am open to any suggestions to try out.
This might work for you (GNU sed):
sed -r 's/^(.*) .* (.*) .*$/\1 \2/;:a;s/^.{1,40}$/ &/;ta;s/^( *)(\S*)/\2\1/' file
or:
sed -r 's/^(.*) .* (.*) .*$/printf "%-20s%20s" \1 \2/e' file
You can use looping in sed to achieve what you want:
#!/bin/bash
echo 'aa bb cc dd
11 22 33333333 44
ONE TWO THREEEEEEEEE FOUR' | \
sed -e 's/\(.*\) \(.*\) \(.*\) \(.*\)/\1 \3/g' \
-e '/\([^ ]*\) \([^ ]*\)/ { :x ; s/^\(.\{1,19\}\) \(.\{1,19\}\)$/\1 \2/g ; tx }'
The two 19's control the width of your columns. The :x is a label which is looped to by tx whenever the preceding substitution succeeded. (You could add a p; before tx to "debug" it.
It most easy to use awk in this case...
You could too use a bash loop to calculate the number of space and run this command on the line covered :
while read; do
# ... calculate $SPACE ...
echo $REPLY|sed "s/\([^\ ]*\)\ *[^\ ]*\ *\([^\ ]*\)/\1$SPACES\2/g"
done < file
But I prefer use awk for do all that (or other advanced shell languages ​​such as Perl, Python, PHP shell mode, ...)
TemplateSpace=" "
TemplateSize=${#TemplateSpace}
sed "
# split your group (based on word here but depend on your real need)
s/^ *\(\w\) \(\w\) \(\w\) \(\w\).*$/\1 \3/
# align
s/$/${TemplateSpace}/
s/^\(.\{${TemplateSize}\}\).*$/\1/
s/\(\w\) \(\w\)\( *\)/\1 \3\2/
"
or more simple for avoiding TemplateSize (and there are no dot in content)
TemplateSpace="............................................................."
and replace
s/^\(.\{${TemplateSize}\}.*$/\1/
by
s/^\(${TemplateSpace}\).*$/\1/
s/\./ /g
Del columns 2 and 4. Right justify resulting col 2 at line length of 23 chars.
sed -e '
s/[^ ]\+/ /4;
s/[^ ]\+//2;
s/^\(.\{23\}\).*$/\1/;
s/\(^[^ ]\+[ ]\+\)\([^ ]\+\)\([ ]\+\)/\1\3\2/;
'
or gnu sed with extended regex:
sed -r '
s/\W+\w+\W+(\w+)\W+\w+$/\1 /;
s/^(.{23}).*/\1/;
s/(+\W)(\w+)(\W+)$/\1\3\2/
'
This question is old, but I like to see it as a puzzle.
While I love the loop solution for its brevity, here is one without a loop or shell help.
sed -E "s/ \w+ (\w+) \w+$/ \1/;h;s/./ /g;s/$/# /;s/( *)#\1//;x;H;x;s/\n//;s/^( *)(\w+)/\2\1/"
or without extended regex
sed "s/ .* \(.*\) .*$/ \1/;h;s/./ /g;s/$/# /;s/\( *\)#\1//;x;H;x;s/\n//;s/^\( *\)\([^ ]*\)/\2\1/"

Why is my file filled with extbar after running sed?

Based on the information at https://tex.stackexchange.com/questions/48933/which-symbols-need-to-be-escaped-in-context, I want to prepare a file for use with ConTeXt. I need to make several replacements:
Replace # with \#.
Replace % with \percent.
Replace | with \textbar.
Replace $ with \textdollar.
Replace _ with \textunderscore.
Replace ~ with \textasciitilde.
Replace { with \textbraceleft.
Replace } with \textbraceright.
I have tried using the information from Replacing "#", "$", "%", "&", and "_" with "\#", "\$", "\%", "\&", and "\_" to do these replacements:
sed -i 's/\&/\\\&/g' ./File.csv
sed -i 's/\#/\\\#/g' ./File.csv
sed -i 's/\%/\\\percent/g' ./File.csv
sed -i 's/\|/\\\textbar/g' ./File.csv
sed -i 's/\$/\\\textdollar/g' ./File.csv
sed -i 's/\_/\\\textunderscore/g' ./File.csv
sed -i 's/\~/\\\textasciitilde/g' ./File.csv
sed -i 's/\{/\\\textbraceleft/g' ./File.csv
sed -i 's/\}/\\\textbraceright/g' ./File.csv
Unfortunately, when I run these scripts, the entire file is changed to a bunch of strange letters, numbers, and the words "extbar" everywhere.
How can I make these replacements?
Why is "extbar" appearing in my file after running these commands?
when you do
sed -i 's/|/\\\textbar/g' ./File.csv
sed reads it as s/|/\\\textbar/g \\ becomes \ and \t becomes tab character.
Try
sed -i "s/|/\\\textbar/g"
or
sed -i 's/|/\\textbar/g'
Use four backslashes instead of the to escape. They are evaluated twice. Following, you have the character \tas replacement, followed by the string 'extbar'(from \textbar)
This might work for you:
cat <<\! >Village.sed
s/&/\\&/g
s/#/\\#/g
s/%/\\percent/g
s/|/\\textbar/g
s/\$/\\textdollar/g
s/_/\\textunderscore/g
s/~/\\textasciitilde/g
s/{/\\textbraceleft/g
s/}/\\textbraceright/g
!
sed -f Village.sed ./File.csv
Not sure why "extbar" is appearing in your file probably to do with the line s/\|/\\\textbar/g where \| means alternation.
See here:
echo foo | sed 's/\|/\\bar/'
\barfoo
echo foo | sed 's/|/\\bar/'
foo

Resources