A PWM with gapped alignments in Biopython - bioinformatics

I'm trying to generate a Position-Weighted Matrix (PWM) in Biopython from Clustalw multiple sequence alignments. I get a "Wrong Alphabet" error every time I do it with gapped alignments. From reading the documentation, I think I need to utilize the Gapped Alphabet to deal with the '-' character in gapped alignments. But when I do this, it still doesn't resolve the error. Does anyone see the problem with this code, or have a better way to generate a PWM from gapped Clustal alignments?
from Bio.Alphabet import Gapped
alignment = AlignIO.read("filename.clustalw", "clustal", alphabet=Gapped)
m = Motif.Motif()
for a in alignment:
m.add_instance(a.seq)
m.pwm()

So you want to use clustal to make these gapped alignments? I use Perl, I see you are using Python, but the logic is basically the same. I use a system call to the clustal executable instead of using BioPerl/Biopython. I believe the clustalw2 executable handles gapped alignments without the need to call an alphabet. Not 100 percent sure, but this is a script I use that works for me. Create a directory with all of your aligments files in it (I use .fasta but you can change the flags on the system call to accept others). This is my Perl script, you must modify the executable path in the last line to match clustal's location on your computer. Hope this helps a bit. As a side note, this is good for making many alignments very quickly, which is what I use it for but if you are only looking to align a few files, might want to skip the whole creating a directory and modify the code to accept a filepath and not a dirpath.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA -tree";
}
Cheers,
Wes

Related

random file name in perl generated with unusual characters

Using this perl code below, I try to output some names in a random generated file. But the files are created with weird characters like this:
"snp-list-boo.dwjEUq5Wu^J.txt"
And, obviously when my code looks for these files it says not such file. Also, when I try open the files using "vi", they open like this
vi 'temporary-files/snp-list-boo.dwjEUq5Wu
.txt'
i.e. with a "new line" in the file name. Someone please help me understand and solve this weird issue. Thanks much!
code:
my $tfile = `mktemp boo.XXXXXXXXX`;
my $fh = "";
foreach my $keys (keys %POS_HASH){
open ($fh, '>>', "temporary-files/snp-list-$tfile.txt");
print $fh "$keys $POS_HASH{$keys}\n";
close $fh;
}
mktemp returns a line feed character in its output that you need to chop() or chomp() first.
Instead of using the external mktemp program, why don't you go with File::Temp instead?
Using external programs unnecessarily is a bad idea for a few reasons.
The external program that you use might not be available on all of the systems where your code runs. You are therefore making your program less portable.
Spawning a new sub-shell to run an external program is a lot slower than just doing the work in your current environment.
The values you get back from the external program are likely to have a newline character attached. And you might forget to remove it.
It's the last one that is burning you here. But the others still apply as well.
Perl's standard library has, for many, many years included the File::Temp module which creates temporary files for you without the need to use an external program.
use File::Temp qw/ tempfile /;
# It even opens it and gives you the filehandle.
($fh, $filename) = tempfile();

Why is running opendir, readdir, stat so slow compared to the Windows dir command?

I have a Perl script that is using opendir to read the contents of a directory:
opendir ( DIR, $path ) or next;
while (my $file = readdir DIR) {
Then I'm doing:
-s $file to get the size of each file
(stat($file))[9] to get the modified time of each file
I'm running this from a Windows machine and accessing a Samba share on Ubuntu 14.04.
This is all working fine but the process seems to run very slow compared to when I run a dir listing on the same folder.
Does anyone know why using opendir takes so much longer than a dir listing and if there's any way I can change my script to speed it up?
According to perlport:
On Win32 stat() needs to open the file to determine the link count and update attributes that may have been changed through hard links. Setting ${^WIN32_SLOPPY_STAT} to a true value speeds up stat() by not performing this operation.
Since the files you're accessing are on a Samba share, opening them is probably fairly time consuming. Also, -s makes a stat system call behind the scenes, so calling -s followed by stat is wasteful.
The following should be faster:
local ${^WIN32_SLOPPY_STAT} = 1;
opendir my $dh, $path or die "Failed to opendir '$path': $!";
while (my $file = readdir $dh) {
my ($size, $mtime) = (stat $file)[7, 9];
say join "\t", $file, $size, $mtime;
}
Dir will be much faster as it is binary code that I suspect is very optimized, so so it can retrieve and format the information quickly.
In your script it seems you are doing several calls which have to be interpreted, one for the time and another for the size. Even if the lower calls in Perl are binary code, to get the information it probably has to go through several layers. You could reduce the number of calls by #mob suggestion by saving the returned values of stat and accessing the parts you need. For example:
#items = stat($file);
$size = $items[7];
$modified = $items[9];
which would save one of the calls and possibly speed up the script.
If you want all of the files, you might consider doing a system call to do a directory command and redirect the output to a file, after which you can parse the file to get the information of times and size. This may be a bit faster depending on the amount of files. ( /4 will be a 4 digit year, /t:w will be when it was last written/modified and /c will get rid of the commas in the size)
system("dir /4 /t:w /-c $path > tempList.txt");
Then open and parse redirected file for the information you desire.
open my $in,"tempList.txt" die "Unable to open file tempList.txt";
my #lines = <$in>;
close($in);
chomp(#lines);
foreach ( #lines )
{
next if ( ! ( m/^\d{4}\/\d{2}\/\d{2}\s+ ); # Not a line with a file
#parts = split('\s+');
# Get the parts you need (time and size, where you may have to some other
# work to get it in the desired format
#.....
}
It may be possible to add regex to do matching and pull out items as you need them when testing if you want to process the line. That might save some time and effort as well.

Load all data files in a directory in Octave

I need to load all data in a directory into octave (no matter what their filenames are), so that the data from separate files are loaded into separate matrices. How can I do that?
I've tried to use dir and glob and then use a for loop but I don't know how to get matrices from cells.
I'm not 100% sure of your question. When you mention getting matrices from cells I'm guessing your problem is extracting the filename from the output of readir and glob. If that's so, you can get the names with filenames(1) (if you use {} to index a cell array you get another cell array).
filelist = readdir (pwd)
for ii = 1:numel(filelist)
## skip special files . and ..
if (regexp (filelist{ii}, "^\\.\\.?$"))
continue;
endif
## load your file
load filelist{ii}
## do your maths
endfor
You can use a struct on the load line if the filenames are good data.(filelist{ii}) = load filelist{ii}.
The answer by carandraug is great, I only want to specify that in some Octave versions, the load line may need to be written as:
load (filelist{ii})

Best way to search for a string in a file on a network drive

Here is my problem: We have a file server (Windows 2003) that people keep putting forms on that contain PII. Policy is now that the last 4 of a person's SSN is no longer allowed on any forms on our file servers. I'm trying to figure out a script to scan for a string such as "SSN" or "Last Four" in a document and all I can find are instructions/examples on how to search text files on a local machine. I have seen a lot of threads similar to this but primarily searching a txt file in a local folder. I've seen powershell scripts that do this but (don't ask why) powershell scripting is disabled on our servers.
Is this possible? I've been reading heavily into multiple Perl books to hope for a clue or get me in the right direction and have had 0 luck.
Assuming you get access to the files eventually, here's how you can go about searching a directory of files, looking for a string match.
use strict;
use warnings;
use File::Find;
our $CHECK_FILE_EXTENSION = qr/.txt$/;
File::Find::find({wanted=>\&find_ssn, no_chdir=>1},$_) for #ARGV;
exit;
sub find_ssn
{
## File::Find sets $File::Find::name with full path to file, which is the correct path to an 'open' call when 'no_chdir' is used
return unless $File::Find::name =~ $CHECK_FILE_EXTENSION;
open F,$File::Find::name || die "Can't read file, $File::Find::name, $!\n";
while(<F>)
{
if(/SSN/)
{
## file as 'SSN' in it, do your work here
}
}
close F;
}
Aside from i/o speed, there's no real difference in accessing a file remotely vs locally. It's just a file descriptor.
C:\>perl -MFile::Slurp -E "my $dir = q|//SERVER/Share/Test|; for my $file (read_dir($dir)) { say qq|$file: |, (read_file(qq|$dir/$file|) =~ /foo/) ? q|match| : q|not match| }"
bar.txt: not match
foo.txt: match

Change file names in windows folders

Hi I am trying to change the files names in some of my folders in windows machine.
I have bunch of files with files names starting with caiptal letter example
"Hello.html" but i want to change that to "hello.html" since there are like thousands of files i cannot just go and do it change it manually. I am looking for a script and i just need some help to get started and what should i start with.
I have access to a linux machine i can just copy the files over there and run any scripts i would really appreciate if some one could guide me to get started either in Linux or windows environments.
On some linux system you can use the rename command, which accept regular expression. Try the following:
rename 's/^([A-Z])/\l$1/' *
This should replace any uppercase char at the beginning with a lower case one.
Othewise, if you're not running a linux system that accept such a command, you can write your own little perl script:
#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;
my #files = `ls`;
foreach (#files) {
chomp($_);
if ($_ =~ m/^[A-Z]/) {
my $newname = $_;
$newname =~ s/^([A-Z])/\l$1/;
move($_, $newname);
}
}
exit 0;
A very easy to use option is ReNamer.
Once installed, simply add the files to be renamed and add a case rule to simply change it to lower case or add a regex rule for advanced cases.

Resources