Removing text from a fasta gene name between two characters - bioinformatics

I have a large codon alignment that has a variety of gene names in the headers. The headers are in the following format:
>ENST00000357033.DMD.-1 | CODON | REFERENC
I want to modify all of the headers in the fasta to exclude all characters after the first "." and before the first "|". Desired outcome:
>ENST00000357033 | CODON | REFERENC
I've tried a few sed commands, no dice. Any advice? I'm averse to using awk, since I'd like to keep the formatting of the alignment and awk scares me.
Thank you!

sed '/^>/s/\.[^ ]* / /'
for each line starting with a '>' replace 'dot' followed by some char different from spaces followed by a space, by a space.

no neeed to be scared by awk:
mawk NF=NF FS='[.][^ ]+' OFS=
>ENST00000357033 | CODON | REFERENC

Related

How to extract match in a proper way using grep or another command?

I have been looking into related questions on SO for days but I couldn't find an answer for my use-case. I want to grep a word after matching but without some special characters such as quotes, comma, dash. I have following line in my file.
...
"versionName": "1.1.5000-internal",
...
I want to extract the version itself without anything even -internal suffix, though but I am failed. Desired output would be as follows:
1.1.5000
What I've tried so far:
1. I tried below to extract version name without surrounding quotes but it failed.
grep -oP '(?<="versionName": )[^"]+(?=")' file.json
2. This one gives the match with all specials.
grep -oP '(?<="versionName": ).*' file.json | sed 's/^.*: //'
3. This one works the same as the second one.
grep '"versionName": ' file.json | sed 's/^.*: //'
I know there should be shorter and simpler approach but I am too bad at such stuff. Thanks in advance!
Using jq:
$ jq -r '.versionName|sub("-.*";"")' file.json
Output:
1.1.5000
Why don't you use jq (as you're clearly dealing with a JSON file)?
You can do:
jq -r .versionName file.json
You can use
grep -oP '"versionName":\s*"\K[0-9.]+(?=[^"]*")' file.json
# Or, if there is no need to check for the trailing "
grep -oP '"versionName":\s*"\K[0-9.]+' file.json
See the online demo.
Details:
"versionName": - a literal, fixed string
\s* - zero or more whitespaces
" - a double quotation mark
\K - match reset operator that discards all text matched so far from the match memory buffer
[0-9.]+ - one or more digits/dots
(?=[^"]*") - a positive lookahead that matches a location that is immediately followed with zero or more chars other than " and then a " char.
If you need the first occurrence only, add | head -1 after the grep command.
You can correct your first attempt by including the first double quote in the lookbehind and matching all subsequent characters that are neither double quotes nor dashes:
grep -oP '(?<="versionName": ")[^"-]+' file.json
You could use grep -oP and match the opening double quote. Then use \K to forget what is matched until now and then match the version number.
"versionName": "\K\d+(?:\.\d+)*(?=[^"]*")
Explanation
"versionName": " Match literally, including the opening "
\K Forget what is matched so far
\d+(?:\.\d+)* Match 1+ digits with optional parts starting with a dot and 1+ digits
(?=[^"]*") Assert a " to the right
Regex demo | Bash demo
For example
grep -oP '"versionName": "\K\d+(?:\.\d+)*(?=[^"]*")' file.json
Output
1.1.5000
Or using sed matching the whole line and capture the version number in group 1 and use it \1 in the replacement.
sed -E 's/.*"versionName": "([0-9]+(\.[0-9]+)*)[^"]*".*/\1/gm;t;d' file.json
A way with awk could be:
echo '"versionName": "1.1.5000-internal"' | awk -F'"|-' '{print $4}'
1.1.5000
Setting FS as "|- for this string and printing its corresponding field, $4 in this case.
Or if you want to check previously:
echo '"versionName": "1.1.5000-internal"' | awk -F'"|-' '$2=="versionName"{print $4}'
1.1.5000

Delete words in a line using grep or sed

I want to delete three words with a special character on a line such as
Input:
\cf4 \cb6 1749,1789 \cb3 \
Output:
1749,1789
I have tried a couple sed and grep statements but so far none have worked, mainly due to the character \.
My unsuccessful attempt:
sed -i 's/ [.\c ] //g' inputfile.ext >output file.ext
Awk accepts a regex Field Separator (in this case, comma or space):
$ awk -F'[ ,]' '$0 = $3 "." $4' <<< '\cf4 \cb6 1749,1789 \cb3 \'
1749.1789
-F'[ ,]' - Use a single character from the set space/comma as Field Separator
$0 = $3 "." $4 - If we can set the entire line $0 to Field 3 $4 followed by a literal period "." followed by Field 4 $4, do the default behavior (print entire line)
Replace <<< 'input' with file if every line of that file has the same delimeters (spaces/comma) and number of fields. If your input file is more complex than the sample you shared, please edit your question to show actual input.
The backslash is a special meta-character that confuses bash.
We treat it like any other meta-character, by escaping it, with--you guessed it--a backslash!
But first, we need to grep this pattern out of our file
grep '\\... \\... [0-9]+,[0-9]+ \\... \\' our_file # Close enough!
Now, just sed out those pesky backslashes
| sed -e 's/\\//g' # Don't forget the g, otherwise it'll only strip out 1 backlash
Now, finally, sed out the clusters of 2 alpha followed by a number and a space!
| sed -e 's/[a-z][a-z][0-9] //g'
And, finally....
grep '\\... \\... [0-9]+,[0-9]+ \\... \\' our_file | sed -e 's/\\//g' | sed -e 's/[a-z][a-z][0-9] //g'
Output:
1749,1789
My guess is you are having trouble because you have backslashes in input and can't figure out how to get backslashes into your regex. Since backslashes are escape characters to shell and regex you end up having to type four backslashes to get one into your regex.
Ben Van Camp already posted an answer that uses single quotes to make the escaping a little easier; however I shall now post an answer that simply avoids the problem altogether.
grep -o '[0-9]*,[0-9]*' | tr , .
Locks on to the comma and selects the digits on either side and outputs the number. Alternately if comma is not guaranteed we can do it this way:
egrep -o ' [0-9,]*|^[0-9,]*' | tr , . | tr -d ' '
Both of these assume there's only one usable number per line.
$ awk '{sub(/,/,".",$3); print $3}' file
1749.1789
$ sed 's/\([^ ]* \)\{2\}\([^ ]*\).*/\2/; s/,/./' file
1749.1789

How to extract specific string in a file using awk, sed or other methods in bash?

I have a file with the following text (multiple lines with different values):
TokenRange(start_token:8050285221437500528,end_token:8051783269940793406,...
I want to extract the value of start_token and end_token. I tried awk and cut, but I am not able to figure out the best way to extract the targeted values.
Something like:
cat filename| get the values of start_token and end_token
grep -oP '(?<=token:)\d+' filename
Explanation:
-o: print only part that matches, not complete line
-P: use Perl regex engine (for look-around)
(?<=token:): positive look-behind – zero-width pattern
\d+: one or more digits
Result:
8050285221437500528
8051783269940793406
A (potentially more efficient) variant of this, as pointed out by hek2mgl in his comment, uses \K, the variable-width look-behind:
grep -oP 'token:\K\d+'
\K keeps everything that has been matched to the left of it, but does not include it in the match (see perlre).
Using awk:
awk -F '[(:,]' '{print $3, $5}' file
8050285221437500528 8051783269940793406
First value is start_token and last value is end_token.
a sed version
sed -e '/^TokenRange(/!d' -e 's/.*:\([0-9]*\),.*:\([0-9]*\),.*/\1 \2/' YourFile

bash, text file remove all text in each line before the last space

I have a file with a format like this:
First Last UID
First Middle Last UID
Basically, some names have middle names (and sometimes more than one middle name). I just want a file that only as UIDs.
Is there a sed or awk command I can run that removes everything before the last space?
awk
Print the last field of each line using awk.
The last field is indexed using the NF variable which contains the number of fields for each line. We index it using a dollar sign, the resulting one-liner is easy.
awk '{ print $NF }' file
rs, cat & tail
Another way is to transpose the content of the file, then grab the last line and transpose again (this is fairly easy to see).
The resulting pipe is:
cat file | rs -T | tail -n1 | rs -T
cut & rev
Using cut and rev we could also achieve this goal by reversing the lines, cutting the first field and then reverse it again.
rev file | cut -d ' ' -f1 | rev
sed
Using sed we simply remove all chars until a space is found with the regex ^.* [^ ]*$. This regex means match the beginning of the line ^, followed by any sequence of chars .* and a space . The rest is a sequence of non spaces [^ ]* until the end of the line $. The sed one-liner is:
sed 's/^.* \([^ ]*\)$/\1/' file
Where we capture the last part (in between \( and \)) and sub it back in for the entire line. \1 means the first group caught, which is the last field.
Notes
As Ed Norton cleverly pointed out we could simply not catch the group and remove the former part of the regex. This can be as easily achieved as
sed 's/.* //' file
Which is remarkably less complicated and more elegant.
For more information see man sed and man awk.
Using grep:
$ grep -o '[^[:blank:]]*$' file
UID
UID
-o tells grep to print only the matching part. The regex [^[:blank:]]*$ matches the last word on the line.

Remove all characters existing between first n occurences of a specific character in each line in shell

Say I have txt file with characters as follows:
abcd|123|kds|Name|Place|Phone
ldkdsd|323|jkds|Name1|Place1|Phone1
I want to remove all the characters in each line that exist within first 3 occurences of | character in each line. I want my output as:
Name|Place|Phone
Name1|Place1|Phone1
Could anyone help me figure this out? How can I achieve this using sed?
This would be a typical task for cut
cut -d'|' -f4- file
output:
Name|Place|Phone
Name1|Place1|Phone1
the -f4- means you want from the forth field till the end. Adjust the 4 if you have a different requirement.
You could try the below sed commad,
$ sed -r 's/^(\s*)[^|]*\|[^|]*\|[^|]*\|/\1/g' file
Name|Place|Phone
Name1|Place1|Phone1
^(\s*) captures all the spaces which are at the start.
[^|]*\|[^|]*\|[^|]*\| Matches upto the third |. So this abcd|123|kds| will be matched.
All the matched characters are replaced by the chars which are present inside the first captured group.
This might work for you (GNU sed):
sed 's/^\([^|]*|\)\{3\}//' file
or more readably:
sed -r 's/^([^|]*\|){3}//' file
sed 's/\(\([^|]*|\)\{3\}\)//' YourFile
this is a posix version, on GNU sed force --posix due to the use of | that is interpreted as "OR" and not in posix version.
Explaination
Replace the 3 first occurence (\{3\}) of [ any charcater but | followed by | (\([^|]*|\)) ] by nothing (// that is an empty pattern)
You can print the last 3 fields:
awk '{print $(NF-2),$(NF-1),$NF}' FS=\| OFS=\| file
Name|Place|Phone
Name1|Place1|Phone1

Resources