Open Reading Frame program not printing Amino Acid Sequences - macos

I am working on a program that will be able to read a gene sequence and give me the Open Reading Frames (ORF) and then the protein sequence of each ORF. I have already gotten the code to work for finding the ORFs- but no amino acids will print. I am using Perl on my Mac.
I would like to get the code to tell me the string of amino acids produced from the open reading frames.
Here is my code:
#!/usr/bin/perl
#ORF_Find.txt -> finds long orfs in a DNA sequence
open(CHROM, "chr03.txt"); #Open file chr03.txt containing yeastchrom. 3
$DNA = ""; #start with empty DNA sequence
$header = <CHROM>; #get header of sequence
#Read line from file, join to end of $DNA, repeat until end of file
while ($current_line = <CHROM>)
{
chomp($current_line); #remove newline from end of current_line
$DNA= $DNA . $current_line;
}
#length of DNA sequence
$DNA_length = length($DNA);
#flag for ORF finder
$inORF=0;
#number of ORFs found
$numORFs = 0;
#minimum length
$minimum_codons =100;
#search each reading frame
for ($frame =0; $frame<3; $frame++)
{
print "\nFinding ORFs in frame: +" . ($frame + 1) . "\n";
#search for sequence match and print position of match if found
for ($i =frame; $i<=($DNA_length-3);$i += 3)
{
#get current codon from sequence
$codon= substr ($DNA, $i, 3);
#if not in orf search for ATG, else search for stop codon
if ($inORF == 0)
{
#if current codon is ATG, start ORF
if ($codon eq "ATG")
{
$inORF = 1;
$ORF_length = 1;
$ORF_start = $i;
}
}
elsif($inORF ==1)
{
#if current codon is a stop codon, end ORF
if ($codon eq "TGA" || $codon eq "TAG" || $codon eq "TAA")
{
#if ORF has at least min number of codons,print location
if ($ORF_length >= $minimum_codons)
{
print "FOUND ORF AT POSITION $ORF_start,";
print "length = $ORF_length\n";
$numORFs++;
}
#reset ORF variables
$inORF = 0;
$ORF_length = 0;
}
else
{
#increase length of ORF by one codon
$ORF_length++;
}
}
}
}
#change T to U
$DNA =~ s/T/U/g;
#search each ORF
for ($i=$ORF_start; $i<=($ORF_length-3); $i+=3)
{
#get codon from each ORF
$aa_codon= substr($DNA, $i, 3);
#find amino acids
foreach ($aa_codon eq "ATG")
{
print ("M") #METHIONINE
}
foreach ($aa_codon =~/UU[UC]/)
{
print ("F") #PHENYLALANINE
}
foreach ($aa_codon =~/UU[AG]/ || $aa_codon=~/CU[UCAG]/)
{
print ("L"); #LEUCINE
}
foreach ($aa_codon =~/AU[UAC]/)
{
print ("I"); #ISOLEUCINE
}
foreach ($aa_codon =~/GU[UACG]/)
{
print ("V"); #VALINE
}
foreach ($aa_codon =~/UC[UCAG]/ || $aa_codon=~/AG[UC]/)
{
print ("S"); #SERINE
}
foreach ($aa_codon =~/CC[UCAG]/)
{
print ("P"); #PROLINE
}
foreach ($aa_codon =~/AC[UCAG]/)
{
print ("T"); #THREONINE
}
foreach ($aa_codon =~/GC[UCAG]/)
{
print ("A"); #ALANINE
}
foreach ($aa_codon =~/UA[UC]/)
{
print ("Y"); #TYROSINE
}
foreach ($aa_codon =~/CA[UC]/)
{
print ("H"); #HISTIDINE
}
foreach ($aa_codon =~/CA[AG]/)
{
print ("G"); #GLUTAMINE
}
foreach ($aa_codon =~/AA[UC]/)
{
print ("N"); #ASPARAGINE
}
foreach ($aa_codon =~/AA[AG]/)
{
print ("K"); #LYSINE
}
foreach ($aa_codon =~/GA[UC]/)
{
print ("D"); #ASPARTIC ACID
}
foreach ($aa_codon =~/GA[AG]/)
{
print ("E"); #GLUTAMIC ACID
}
foreach ($aa_codon =~/UG[UC]/)
{
print ("C"); #CYSTINE
}
foreach ($aa_codon eq "UGG")
{
print ("W"); #TRYPTOPHAN
}
foreach ($aa_codon =~/AG[AG]/ || $aa_codon =~/CG[UCAG]/)
{
print ("R"); #ARGININE
}
foreach ($aa_codon =~/GG[UCAG]/)
{
print ("G"); #GLYCINE
}
foreach ($aa_codon =~/UA[AG]/|| $aa_codon eq "UGA")
{
print ("*") #STOP
}
}
#if no ORFS found, print message
if ($numORFs ==0)
{
print ("NO ORFS FOUND\n");
}
else
{
print ("\n$num_ORFs ORFS WERE FOUND\n");
}

First, this question would probably be more appropriate for a forum such as seqAnswers or BioStars. That aside, writing your own 6-frame translation script is a complex task, especially if you want to account for IUPAC ambiguous nucleotides. There are already lots of scripts and tools out there that do this. Probably the easiest suggestion I can make is to use one of the existing tools. Try mine, for example:
https://github.com/hepcat72/sixFrameTranslation/archive/master.zip
My script wasn't public until just now. I have opened it up so that you can use it. Just run it to get a usage.
Other than that, if you want to get your version running properly, the first thing you can do is change your she-bang to:
#!/usr/bin/perl -w
Note the -w. Then, add this line to the top of your script:
use strict;
It will help you debug issues such as the missing dollar sign in one of your for loops:
for ($i =frame; $i<=($DNA_length-3);$i += 3)
It should be:
for ($i =$frame; $i<=($DNA_length-3);$i += 3)
And BTW, it doesn't matter that you're running perl on your Mac. It's just perl. "Mac perl" was a project to create a perl environment back in the pre-OS-X days.

Related

Compress ranges of ranges of numbers in bash

I have a csv file named "ranges.csv", which contains:
start_range,stop_range
9702220000,9702220999
9702222000,9702222999
9702223000,9702223999
9750000000,9750000999
9750001000,9750001999
9750002000,9750002999
I am trying to combine the ranges where the stop_range=start_range-1 and output the result in another csv file named "ranges2.csv". So the output will be:
9702220000,9702220999
9702222000,9702223999
9750000000,9750002999
Moreover, I need to know how many ranges contains a compress range (example: for the new range 9750000000,9750002999 I need to know that before the compression there were 3 ranges). This information will help me to create a new csv file named "ranges3.csv" which should contain only the range with the most ranges inside it (the most comprehensive area):
9750000000,9750002999
I was thinking about something like this:
if (stop_range = start_range-1)
new_stop_range = start_range-1
But I am not very smart and I am new to bash scripting.
I know how to output the results in another file but the function for what I need gives me headaches.
I think this does the trick:
#!/bin/bash
awk '
BEGIN { FS = OFS = ","}
NR == 2 {
start = $1; stop = $2; i = 1
}
NR > 2 {
if ($1 == (stop + 1)) {
i++;
stop = $2
} else {
if (++i > max) {
maxr = start "," stop;
max = i
}
start = $1
i = 0
}
stop = $2
}
END {
if (++i > max) {
maxr = start "," stop;
}
print maxr
}
' ranges.csv
Assuming your ranges are sorted, then this code gives you the merged ranges only:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e; b=e="" }
($1==e+1){ e=$2; next }
{ b=$1; e=$2 }
END { print b,e }' file
Below you get the same but with the range count:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e,c; b=e=c="" }
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { print b,e,c }' file
If you want the largest one, you can sort on the third column. I don't want to make a rule to give the range with the most counts, as there might be multiple.
If you really only want all the ranges with the maximum merge:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){
a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
b=e=c=""
}
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
print a[m]
}' file

How I make a list of missing integer from a sequence using bash

I have a file let's say files_190911.csv whose contents are as follows.
EDR_MPU023_09_20190911080534.csv.gz
EDR_MPU023_10_20190911081301.csv.gz
EDR_MPU023_11_20190911083544.csv.gz
EDR_MPU023_14_20190911091405.csv.gz
EDR_MPU023_15_20190911105513.csv.gz
EDR_MPU023_16_20190911105911.csv.gz
EDR_MPU024_50_20190911235332.csv.gz
EDR_MPU024_51_20190911235400.csv.gz
EDR_MPU024_52_20190911235501.csv.gz
EDR_MPU024_54_20190911235805.csv.gz
EDR_MPU024_55_20190911235937.csv.gz
EDR_MPU025_24_20190911000050.csv.gz
EDR_MPU025_25_20190911000155.csv.gz
EDR_MPU025_26_20190911000302.csv.gz
EDR_MPU025_29_20190911000624.csv.gz
I want to make a list of missing sequence from those using bash script.
Every MPUXXX has its own sequence. So there are multiple series of sequences in that file.
The datetime for missing list will use from previous sequence.
From the sample above, the result will be like this.
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
It would be simpler if there were only a single sequence.
So I can use something like this.
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}'
But I know this can't be used for multiple sequence.
EDITED (Thanks #Cyrus!)
AWK is your friend:
#!/usr/bin/awk
BEGIN {
FS="[^0-9]*"
last_seq = 0;
next_serial = 0;
}
{
cur_seq = $2;
cur_serial = $3;
if (cur_seq != last_seq) {
last_seq = cur_seq;
ts = $4
prev = cur_serial;
} else {
if (cur_serial == next_serial) {
ts = $4;
} else {
for (i = next_serial; i < cur_serial; i++) {
print "EDR_MPU" last_seq "_" i "_" ts ".csv.gz"
}
}
}
next_serial = cur_serial + 1;
}
And then you do:
$ < files_190911.csv awk -f script.awk
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
The assignment to FS= splits lines by the regex. The rest program detects holes in sequences and prints them with the appropriate timestamp.

How to take out certain elements from a pdb file

I am trying to take out certain columns from a pdb file. I already have taken out all lines that start out with ATOM in my code. For some reason my sub functions are not working and I do not know where or how to call them.
My code is:
open (FILE, $ARGV[0])
or die "Could not open file\n";
my #newlines;
while ( my $line = <FILE> ) {
if ($line =~ m/^ATOM.*/) {
push #newlines, $line;
}
}
my $atomcount = #newlines;
#print "#newlines\n";
#print "$atomcount\n";
##############################################################
#This function will take out the element from each line
#The element is from column 77 and contains one or two letters
sub atomfreq {
foreach my $record1(#newlines) {
my $element = substr($record1, 76, 2);
print "$element\n";
return;
}
}
################################################################
#This function will take out the residue name from each line
#The element is from column 18 and contains 3 letters
sub resfreq {
foreach my $record2(#newlines) {
my $residue = substr($record2, 17, 3);
print "$residue\n";
return;
}
}
As #Ossip already said in this answer you simply need to call your functions:
sub atomfreq {
...
}
sub resfreq {
...
}
atomfreq();
resfreq();
But I'm not sure whether these functions do what you intended because the comments imply that they should print every $residue and $element from the #newlines array. You've put a return statement inside the for loop which will immediately return from the whole function (and its for loop) so it will print only the first $residue or $element. Because the functions aren't supposed to return anything you can just drop that statement:
sub atomfreq {
foreach my $record1(#newlines) {
my $element = substr($record1, 76, 2);
print "$element\n";
}
}
sub resfreq {
foreach my $record2(#newlines) {
my $residue = substr($record2, 17, 3);
print "$residue\n";
}
}
atomfreq();
resfreq();
You can just call them right under your other code like this:
atomfreq();
resfreq();

Alogrithm in using perl to find the value in array - Absolutely Interview Questions

I am asked to do the perl program to find a value(from user input) in array. If matched "its ok". If not matched, then check within the value in the index[0] to index[1] ... index[n]. So then if the value matched to the between two elements then report which is near to these elements might be index[0] or index[1].
Let you explain.
Given array : 10 15 20 25 30;
Get the value from user : 14 (eg.)
Hence 14 matched with in the two elements that is 10(array[0]) - 15(array[1])
Ultimately the check point is do not use more than one for loop and never use the while loop. You need to check one for loop and many of if conditions.
I got the output by which I did here is:
use strict;
use warnings;
my #arr1 = qw(10 15 20 25 30);
my $in = <STDIN>;
chomp($in);
if(grep /$in/, #arr1)
{ } #print "S: $in\n"; }
else
{
for(my $i=0; $i<scalar(#arr1); $i++)
{
my $j = $i + 1;
if($in > $arr1[$i] && $in < $arr1[$j])
{
#print "SN: $arr1[$i]\t$arr1[$j]\n";
my ($inc, $dec) = "0";
my $chk1 = $arr1[$i] + 1;
AGAIN1:
if($in == $chk1)
{ }
else
{ $chk1++; $inc++; goto AGAIN1; }
my $chk2 = $arr1[$j] - 1;
AGAIN2:
if($in == $chk2){ }
else
{ $chk2--; $dec++; goto AGAIN2; }
if($inc > $dec)
{ print "Matched value nearest to $arr1[$j]\n"; }
elsif($inc < $dec)
{ print "Matched value nearest to $arr1[$i]\n"; }
}
}
}
However my question is there a way in algorithm?. Hence if someone can help on this one and it would be appreciated.
Thanks in advance.
You seem determined to make this as complicated as possible :-)
Your specification isn't completely clear, but I think this does what you want:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my #array = qw[10 15 20 25 30];
chomp(my $in = <STDIN>);
if ($in < $array[0]) {
say "$in is less than first element in the array";
exit;
}
if ($in > $array[-1]) {
say "$in is greater than last element in the array";
exit;
}
for (0 .. $#array) {
if ($in == $array[$_]) {
say "$in is in the array";
exit;
}
if ($in < $array[$_]) {
if ($in - $array[$_ - 1] < $array[$_] - $in) {
say "$in is closest to $array[$_ - 1]";
} else {
say "$in is closest to $array[$_]";
}
exit;
}
}
say "Shouldn't get here!";
Using the helper functions any and reduce from the core module List::Util and the built in abs.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce any/;
my #arr1 = qw(10 15 20 25 30);
chomp(my $in = <STDIN>);
if (any {$in == $_} #arr1) {
print "$in is in the array\n";
}
else {
my $i = reduce { abs($in - $arr1[$a]) > abs($in - $arr1[$b]) ? $b : $a} 0 .. $#arr1;
print "$in is closest to $arr1[$i]\n";
}

Unable to increment last 2 digit of variable declared in file using script

I have the file given below:
elix554bx.xayybol.42> vi setup.REVISION
# Revision information
setenv RSTATE R24C01
setenv CREVISION X3
exit
My requirement is to read RSTATE from file and then increment last 2 digits of RSTATE in setup.REVISION file and overwrite into same file.
Can you please suggest how to do this?
If you're using vim, then you can use the sequence:
/RSTATE/
$<C-a>:x
The first line is followed by a return and searches for RSTATE. The second line jumps to the end of the line and uses Control-a (shown as <C-a> above, and in the vim documentation) to increment the number. Repeat as often as you want to increment the number. The :x is also followed by a return and saves the file.
The only tricky bit is that the leading 0 on the number makes vim think the number is in octal, not decimal. You can override that by using :set nrformats= followed by return to turn off octal and hex; the default value is nrformats=octal,hex.
You can learn an awful lot about vim from the book Practical Vim: Edit Text at the Speed of Thought by Drew Neil. This information comes from Tip 10 in chapter 2.
Here's an awk one-liner type solution:
awk '{
if ( $0 ~ 'RSTATE' ) {
match($0, "[0-9]+$" );
sub( "[0-9]+$",
sprintf( "%0"RLENGTH"d", substr($0, RSTART, RSTART+RLENGTH)+1 ),
$0 );
print; next;
} else { print };
}' setup.REVISION > tmp$$
mv tmp$$ setup.REVISION
Returns:
setenv RSTATE R24C02
setenv CREVISION X3
exit
This will handle transitions from two to three to more digits appropriately.
I wrote for you a class.
class Reader
{
public string ReadRs(string fileWithPath)
{
string keyword = "RSTATE";
string rs = "";
if(File.Exists(fileWithPath))
{
StreamReader reader = File.OpenText(fileWithPath);
try
{
string line = "";
bool finded = false;
while (reader != null && !finded)
{
line = reader.ReadLine();
if (line.Contains(keyword))
{
finded = true;
}
}
int index = line.IndexOf(keyword);
rs = line.Substring(index + keyword.Length +1, line.Length - 1 - (index + keyword.Length));
}
catch (IOException)
{
//Error
}
finally
{
reader.Close();
}
}
return rs;
}
public int GetLastTwoDigits(string rsState)
{
int digits = -1;
try
{
int length = rsState.Length;
//Get the last two digits of the rsstate
digits = Int32.Parse(rsState.Substring(length - 2, 2));
}
catch (FormatException)
{
//Format Error
digits = -1;
}
return digits;
}
}
You can use this as exists
Reader reader = new Reader();
string rsstate = reader.ReadRs("C://test.txt");
int digits = reader.GetLastTwoDigits(rsstate);

Resources