Remove pattern from each line in fasta file in - terminal

I have a fasta file, file.fasta, that has the following patterns:
>firstnumber 01abc_numericsequence
CGTAATCG
>secondnumber 01abc_anothernumericsequence
GGTAAACC
and so on, but I'd like the output to be something like:
>firstnumber
CGTAATCG
>secondnumber
CGTAAACC
How can I delete the pattern 01abc and everything that goes after it in each line, and overwrite the file.fasta?
Please, can anyone provide a solution?

cat fasta
>firstnumber 01abc_numericsequence
CGTAATCG
>secondnumber 01abc_anothernumericsequence
GGTAAACC
awk '/^>/ {$0=$1} 1' fasta
>firstnumber
CGTAATCG
>secondnumber
GGTAAACC
sed '/^>/ s/ .*//' fasta
>firstnumber
CGTAATCG
>secondnumber
GGTAAACC
Both the sed and awk replace everything from the first space (inclusive) onward on every line that starts with >

I've tried
sed 's/01abc*//' file.fasta
The problem is that not only did it remove the pattern, but it also didn't remove both _numericsequence and _anothernumericsequence. Also, the changes were not saved in file.fasta.
Then, I tried
ex -sc '%s/\(\01abc\).*/\1/ | x' file.fasta
And it removed both _numericsequence and _anothernumericsequence. The problem is that I want to remove the pattern too, and it didn't.
Finally, I've tried
ex -sc '%s/\(\ \).*/\1/ | x' file.fasta
And it worked, because the other lines doesn't have any spaces, in this case.

Related

I need delete two " " with sed command

I need to delete "" in file
"CITFFUSKD-E0"
I have tried sed 's/\"//.
Result is:
CITFFUSKD-E0"
How I can delete both ?
Also I need to delete everything behind first word but input can be this one:
"CITFFUSKD-E0"
"CITFFUSKD_E0"
"CITFFUSKD E0"
Result I want it:
CITFFUSKD
You may use
sed 's/"//g' file | sed 's/[^[:alnum:]].*//' > newfile
Or, contract the two sed commands into one sed call as #Wiimm suggests:
sed 's/"//g;s/[^[:alnum:]].*//' file > newfile
If you want to replace inline, see sed edit file in place.
Explanation:
sed 's/"//g' file - removes all " chars from the file
sed 's/[^[:alnum:]].*//' > newfile - also removes all chars from a line starting from the first non-alphanumeric char and saves the result into a newfile.
Could you please try following.
awk 'match($0,/[a-zA-Z]+[^a-zA-Z]*/){val=substr($0,RSTART,RLENGTH);gsub(/[^a-zA-Z]+/,"",val);print val}' Input_file
delete everything behind first word
sed 's/^"\([[:alpha:]]*\)[^[:alpha:]]*.*/\1/'
Match the first ". Then match a sequence of alphabetic characters. Match until you find non-alphabetic character ^[:alpha:]. Then match the rest. Substitute it all for \1 - it is a backreference for the part inside \( ... \), ie. the first word.
I need delete two “ ” with sed command
Remove all possible ":
sed 's/"//g'
Extract the string between ":
sed 's/"\([^"]*\)"/\1/'
Remove everything except alphanumeric characters (numbers + a-z + a-Z, ie. [0-9a-zA-z]):
sed 's/[^[:alnum:]]//g'
This should do all in one go, remove the ", print the first part:
awk -F\" '{split($2,a,"-| |_");print a[1]}' file
CITFFUSKD
CITFFUSKD
CITFFUSKD
When you have 1 line, you can use
grep -Eo "(\w)*" file | head -1
For normal files (starting with a double quote on each line)
, try this
tr -c [^[:alnum:]] '"' < file | cut -d'"' -f2
Many legitimate ways to solve this.
I favor using what you know about your data to simplify solutions -- this is usually an option. If everything in your file follows the same pattern, you can simply extract the first set of capitalized letters encountered:
sed 's/"\([A-Z]\+\).*$/\1/' file
awk '{gsub(/^.|....$/,"")}NR==1' file
CITFFUSKD

Sed doubts when n occurrences are used

I'm trying to replace the nth occurrence of a substring in a file. I tried to achieve this using sed but all attempts failed to give me the desired output. Some of the attempts are:
sed 's/old/new/g'
sed 's/old/new/3'
sed 's/old/new/3g'
The most common usage of sed is to perform a replacement such as
sed 's/foo/bar/' file
This will replace the first occurrence of the string foo by the string bar and it will do this for every line in file.
If you want to replace the 3rd occurrence of the string foo only, but do this for every line, then you can write:
sed 's/foo/bar/3' file.
Finally, if you want to replace all occurrences, then you use :
sed 's/foo/bar/g' file.
Any combination such as
sed 's/foo/bar/3g' file
results in unspecified behaviour.
If you want to replace the nth occurrence in a file than sed is not the right tool, but perl or awk might be better.
If you know you have maximum one occurrence of "foo" per line, you can do
awk '/foo/{c++}(c==n){sub("foo","bar")}1' file
If more than a single occurrence per line might appear it becomes a bit more tricky, various solutions are possible:
awk 'BEGIN{FS="foo";OFS="bar";n=5}
(c<n) && (c+NF-1>=n) {
for(i=1;i<NF;++i) printf $i ((++c==n) ? OFS : FS); print $NF; next
}
{c+=NF-1; print}' file

Bash script delete a line in the file

I have a file, which has multiple lines.
For example:
a
ab#
ad.
a12fs
b
c
...
I want to use sed or awk delete the line, if the line include symbols or numbers. (For example, I want to delete: ab#, ad., a12fs.... lines)
or in another words, I just want to keep the line which include [a-z][A-Z] .
I know how to delete number line,
sed '/[0-9]/d' file.txt
but I do not know how to delete symbols lines.
Or there has any easy way to do that?
To keep blank lines:
grep '^[[:alpha:]]*$' file
sed '/[^[:alpha:]]/d' file
awk '/^[[:alpha:]]*$/' file
To remove blank lines:
grep '^[[:alpha:]]+$' file
sed -E -n '/^[[:alpha:]]+$/p' file
awk '/^[[:alpha:]]+$/' file
grep works well too and is even simpler: just do the reverse: keep the lines that interest you, which are way easier to define
grep -i '^[a-z]*$' file.txt
(match lines containing only letters and empty lines, and -i option makes grep case-insensitive)
to remove empty lines as well:
grep -i '^[a-z]+$' file.txt
caution when using Windows text files, as there's a carriage return at the end of the line, so nothing would match depending on grep versions (tested on windows here and it works)
but just in case:
grep -iP '^[a-z]*\r?$'
(note the P option to enable perl expressions or \r is not recognized)
You can use this sed:
sed '/^[A-Za-z0-9]\+$/!d' file
(OR)
sed '/[^A-Za-z0-9]/d' file
$ awk '!/[^[:alpha:]]/' file.txt
a
b
c

Display all fields except the last

I have a file as show below
1.2.3.4.ask
sanma.nam.sam
c.d.b.test
I want to remove the last field from each line, the delimiter is . and the number of fields are not constant.
Can anybody help me with an awk or sed to find out the solution. I can't use perl here.
Both these sed and awk solutions work independent of the number of fields.
Using sed:
$ sed -r 's/(.*)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
Note: -r is the flag for extended regexp, it could be -E so check with man sed. If your version of sed doesn't have a flag for this then just escape the brackets:
sed 's/\(.*\)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
The sed solution is doing a greedy match up to the last . and capturing everything before it, it replaces the whole line with only the matched part (n-1 fields). Use the -i option if you want the changes to be stored back to the files.
Using awk:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file
1.2.3.4
sanma.nam
c.d.b
The awk solution just simply prints n-1 fields, to store the changes back to the file use redirection:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file > tmp && mv tmp file
Reverse, cut, reverse back.
rev file | cut -d. -f2- | rev >newfile
Or, replace from last dot to end with nothing:
sed 's/\.[^.]*$//' file >newfile
The regex [^.] matches one character which is not dot (or newline). You need to exclude the dot because the repetition operator * is "greedy"; it will select the leftmost, longest possible match.
With cut on the reversed string
cat youFile | rev |cut -d "." -f 2- | rev
If you want to keep the "." use below:
awk '{gsub(/[^\.]*$/,"");print}' your_file

Delete all lines beginning with a # from a file

All of the lines with comments in a file begin with #. How can I delete all of the lines (and only those lines) which begin with #? Other lines containing #, but not at the beginning of the line should be ignored.
This can be done with a sed one-liner:
sed '/^#/d'
This says, "find all lines that start with # and delete them, leaving everything else."
I'm a little surprised nobody has suggested the most obvious solution:
grep -v '^#' filename
This solves the problem as stated.
But note that a common convention is for everything from a # to the end of a line to be treated as a comment:
sed 's/#.*$//' filename
though that treats, for example, a # character within a string literal as the beginning of a comment (which may or may not be relevant for your case) (and it leaves empty lines).
A line starting with arbitrary whitespace followed by # might also be treated as a comment:
grep -v '^ *#' filename
if whitespace is only spaces, or
grep -v '^[ ]#' filename
where the two spaces are actually a space followed by a literal tab character (type "control-v tab").
For all these commands, omit the filename argument to read from standard input (e.g., as part of a pipe).
The opposite of Raymond's solution:
sed -n '/^#/!p'
"don't print anything, except for lines that DON'T start with #"
you can directly edit your file with
sed -i '/^#/ d'
If you want also delete comment lines that start with some whitespace use
sed -i '/^\s*#/ d'
Usually, you want to keep the first line of your script, if it is a sha-bang, so sed should not delete lines starting with #!. also it should delete lines, that just contain only a hash but no text. put it all together:
sed -i '/^\s*\(#[^!].*\|#$\)/d'
To be conform with all sed variants you need to add a backup extension to the -i option:
sed -i.bak '/^\s*#/ d' $file
rm -Rf $file.bak
You can use the following for an awk solution -
awk '/^#/ {sub(/#.*/,"");getline;}1' inputfile
This answer builds upon the earlier answer by Keith.
egrep -v "^[[:blank:]]*#" should filter out comment lines.
egrep -v "^[[:blank:]]*(#|$)" should filter out both comments and empty lines, as is frequently useful.
For information about [:blank:] and other character classes, refer to https://en.wikipedia.org/wiki/Regular_expression#Character_classes.
If you want to delete from the file starting with a specific word, then do this:
grep -v '^pattern' currentFileName > newFileName && mv newFileName currentFileName
So we have removed all the lines starting with a pattern, writing the content into a new file, and then copy the content back into the source/current file.
You also might want to remove empty lines as well
sed -E '/(^$|^#)/d' inputfile
Delete all empty lines and also all lines starting with a # after any spaces:
sed -E '/^$|^\s*#/d' inputfile
For example, see the following 3 deleted lines (including just line numbers!):
1. # first comment
2.
3. # second comment
After testing the command above, you can use option -i to edit the input file in place.
Just this!
Here is it with a loop for all files with some extension:
ll -ltr *.filename_extension > list.lst
for i in $(cat list.lst | awk '{ print $8 }') # validate if it is the 8 column on ls
do
echo $i
sed -i '/^#/d' $i
done

Resources