How to remove small contigs from fasta files?

How to remove small contigs from fasta files? - shell

## removesmalls.pl
#!/usr/bin/perl
use strict;
use warnings;
my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
local $/=">";
while(<>) {
chomp;
next unless /\w/;
s/>$//gs;
my #chunk = split /\n/;
my $header = shift #chunk;
my $seqlen = length join "", #chunk;
print ">$_" if($seqlen >= $minlen);
}
local $/="\n";
}
Exexecuting the script as follows:
perl removesmalls.pl 1000 contigs.fasta > contigs-1000.fasta
The above script works for me but there is a problem,
i have 109 different fasta files with different file names.
i can run the script for individual file but i want to run the script at once for all files and the result file should be individually different for each.
file names are like SRR8224532.fasta, SRR8224533.fasta, SRR8224534.fasta, and so on
i want the result files after removing the contigs (i.e., for me less than 1000) something like SRR8224532-out.fasta,
SRR8224533-out.fasta, and so on.
Any help or suggestion would be helpfull.

Related

How to Pass more than one file in perl MY function [Perl]

I am new to Perl and was wondering if you guys can help me in regards to passing more than one files in the below code;
my #files=<data/j*.*.txt>;
if (#ARGV) {
my $test=$ARGV[0];
$test=lc($test);
print "Using $test instead\n";
#files=</data/$test*.*.txt>;
print "Found #files instead\n";
}
my $outfile='/data/w_c.txt';
my $lotfile='/data/completed.txt';
if (-e $outfile) {
unlink $outfile;
}
In the above code (my #files=<data/j*.*.txt>;) is currently having all the files starting with j*.*, But I would like to pass all the below files only;
j*.1.txt
c*.3.1.txt
a*.a.b.txt
etc..
How could I pass the list of files in the program itself? I am trying to read all those files and extract information from them..!
Thank you in advance..

You can use something like this:
<data/j*.*.txt data/j*.1.txt data/a*.a.b.txt>
There comes a point where it might be best to use <data/*.txt> and use a regex to filter out all but those you want.

Rather than using globs this way I'd be tempted to switch to opendir and readdir and to use an array of patterns in a regex with alternation to select my files. That way you're not using two different text wildcard syntaxes (for glob and for regex) in the same short snippet of code, which I've seen confuse programmers new to Perl before.
# Set your data directory.
my $dir = '/data';
# Take the whole array of arguments on the command line as patterns to
# match in the regex, or default to a short list of patterns if there
# are none.
# (Consider using an options library later rather than messing
# with #ARGV directly if the program becomes more complex.)
my #filespecs = ( scalar #ARGV ? #ARGV : qw( j.*?\.1\.txt c.*?\.3\.1\.txt ) );
# Join the multiple patterns with the regex alternation character.
# This makes them multiple matching options in a single regex.
my $re = join '|', #filespecs;
# Open the directory for reading, or terminate with an error.
opendir my $d, $dir or die "Cannot open directory $dir : $!\n";
# Select into the #files array things read from the directory
# entry that are regular files (-f), do not start with '.',
# and which match the regex.
my #files = grep { (-f) && (!/^\./) && (/$re/) } readdir $d;
# Close the directory handle now that we're done using it.
closedir $d;
Without the overly verbose comments, that boils down to just this.
my $dir = '/data';
my #filespecs = ( scalar #ARGV ? #ARGV : qw( j.*?\.1\.txt c.*?\.3\.1\.txt ) );
my $re = join '|', #filespecs;
opendir my $d, $dir or die "Cannot open directory $dir : $!\n";
my #files = grep { (-f) && (!/^\./) && (/$re/) } readdir $d;
closedir $d;
I elided the last few lines of your original code because it doesn't seem directly related to your question.
Some sources for you to read that may help make sense of this solution.:
perldoc perlop for the Conditional Operator
https://perldoc.perl.org/perlop#Conditional-Operator , and for qw()
https://perldoc.perl.org/perlop#qw/STRING/
perldoc perlre to learn
about Perl regexes, especially in this case alternation
https://perldoc.perl.org/perlre#Metacharacters
perldoc perlfunc for the -f file test https://perldoc.perl.org/perlfunc#-X-FILEHANDLE , opendir https://perldoc.perl.org/perlfunc#opendir-DIRHANDLE,EXPR , readdir https://perldoc.perl.org/perlfunc#readdir-DIRHANDLE , closedir https://perldoc.perl.org/perlfunc#closedir-DIRHANDLE , and grep https://perldoc.perl.org/perlfunc#grep-BLOCK-LIST

Executing Perl script from windows-command line with 2 entry

this is my Perl script
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
sub xml2array{
my $path = shift;
my $twig = XML::Twig->new->parsefile($path);
return map { $_ -> att('VirtualPath') } $twig -> get_xpath('//Signals');
}
sub compareMappingToArray {
my $mapping = shift;
my $signalsRef = shift;
my $i = 1;
print "In file : $mapping\n";
open(my $fh, $mapping);
while (my $r = <$fh>) {
chomp $r;
if ($r =~ /\'(ModelSpecific.*)\'/) {
my $s = $1;
my #matches = grep { /^$s$/ } #{$signalsRef};
print "line $i : not found - $s\n" if scalar #matches ==0;
print "line $i : multiple $s\n" if scalar #matches > 1;
}
$i = $i + 1 # keep line index
}
}
my $mapping = "C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm";
my #virtualpath = xml2array("SignalModel.xml");
compareMappingToArray($mapping, \#virtualpath);
The script works well, the aim of it is to compare the file "SignalModel.xml" and "CAN_ESP_002_mapping.pm" and putting the lines that didn't matches in a .TXT file. Here is how the .TXT file looks like:
In file : C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm
line 331 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_202.R2B_VCU_202__byte_3
line 348 : not found - ModelSpecific.EID.NET.CAN_Engine.CMM_WX.Transmit.CMM_HYB_208.R2B_CMM_HYB_208__byte_2
line 368 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_222.R2B_VCU_222__byte_0
But for this script, I put the two files that need to be compare inside of the code and instead of doing that, I would like to run the script in windows cmd line and having something like:
C:\Users>perl CANMappingChecker.pl -'file 1' 'file 2'
All the files are in .zip file so if I can execute the script that he goes inside and take the 2 files that I need for comparison, it should be perfect.
I really don't know how to do and what to put inside my script to make that in the cmd windows. Thanks for your help !

Program (or script) parameters are stored in the #ARGV array. shift and pop without any parameter will work on #ARGV when used outside of a sub, in a sub they operate on #_.
See Archive::Zip for zip file handling.

file organisation in windows using perl

I am working on a windows machine and I have a directory filled with ~200k of files which I need to organise. This is a job I will need to do regularly with different filename sets but with similar patterns so perl seemed a good tool to use.
Each filename is made up of {a string A}{2 or 3 digit number B}{single letter "r" or "x"}{3 digit number}.extension
I want to create a folder for each string A
Within each folder I want a sub-folder for each B
I then want to move each file into its relevant sub-folder
So it will end up looking something like
/CustomerA/1
/CustomerA/2
/CustomerA/3
/CustomerB/1
/CustomerB/2
/CustomerB/3
etc with the files in each sub-folder
so CustomerA888x123.xml is moved into /CustomerA/888/
I have the list of files in an array but I am struggling with splitting the file name out to its constituent parts and using the parts effectively.
Thanks for the answer. I ended up with this:
#!usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
use File::Path qw(make_path);
opendir my $dir, ".";
my #files = readdir($dir);
closedir $dir;
foreach my $file (#files) {
my ($cust, $num) = $file =~ m/(\D+)(\d+)/;
my $dirname = "$cust/$num";
my #dirs_made = make_path($dirname, { verbose => 1 });
move($file, $dirname) or warn "cant move $file to $dirname: $!";
}

Given your description of file names, this regex should parse what you need
my ($cust, $num) = $filename =~ m/(\D+)(\d+)/;
Use a more precise pattern if you wish or need to be more specific about what precedes the number, for example [a-zA-Z] for letters only.
With that on hand, you can create directories using the core module File::Path, for example
use File::Path qw(make_path);
my $dirname = "$cust/$num";
my #dirs_made = make_path($dirname, { verbose => 1 });
This creates the path as needed, returning the names of created directories. It also prints the names with the verbose. If the directory exists it quietly skips it. If there are problems it raises a die so you may want to wrap it in eval
eval { make_path($dirname) };
if ($#) {
warn "Error with make_path($dirname): $#";
}
Also note the File::Path::Tiny module as an alternative, thanks to Sinan Ünür for bringing it up. Other than being far lighter, it also has the more common error-handling policy whereby a false is returned on failure so you don't need an eval but only the usual check
use File::Path::Tiny;
File::Path::Tiny::mk($path) or warn "Can't mk($path): $!";
The module behaves similarly to mkdir in many ways, see the linked documentation.
Move the files using the move function form the core module File::Copy, for example
use File::Copy qw(move);
move($file, $dirname) or warn "Can't move $file to $dirname: $!";
All this can be in a loop over the array with the file names.

Perl - Using Variables from an Input File in the URL when a Variable has a Space (two words)

What am I doing? The script loads a string from a .txt (locations.txt), and separates it into 6 variables. Each variable is separated by a comma. Then I go to a website, whose address depends on these 6 values.
What is the problem? If there is a space as a character in a variable as part of a string in locations.txt. When there is a space, it does not get the correct url.
The input file is:
locations.txt = Heinz,Weber,Sierra Leone,1915,M,White
Because Sierra Leone has a space, the url is:
https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3AHeinz%20%2Bsurname%3AWeber%20%2Bbirth_place%3A%22Sierra%20Leone%22%20%2Bbirth_year%3A1914-1918~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
But that does not get processed correctly in the code below.
I'm using the packages:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use HTML::TableExtract;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
This is the beginning of the code :
open(my $l, 'locations26.txt') or die "Can't open locations: $!";
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
while (my $line = <$l>) {
chomp $line;
my %args;
#args{qw/givenname surname birth_place birth_year gender race/} = split /,/, $line;
$args{birth_year} = ($args{birth_year} - 2) . '-' . ($args{birth_year} + 2);
my $mech = WWW::Mechanize::Firefox->new(create => 1, activate => 1);
$mech->get("https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219");
# REST OF THE SCRIPT HERE. MANY LINES.
}
As another example, the following would work:
locations.txt = Benjamin,Schuvlein,Germany,1913,M,White

I have not used Mechanize, so not sure whether you need to encode the URL. Try encoding space to %20 or + before running $mech->get
$url =~ s/ /+/g;
Or
$url =~ s/ /%20/g
whichever works :)
====
Edit:
my $url = "https://familysearch.org/search/collection/results#count=20& query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219";
$url =~ s/ /+/g;
$mech->get($url);
Try that.

If you have the error
Global symbol "$url" requires explicit package name.
this means that you forgot to declare $url with :
my $url;
Your use part seems freaky, I'm pretty sure that you don't need all of those modules # the same time. If you use WWW::Mechanize, no need LWP::UserAgent and CGI I guess...

Searching Multiple Strings in Huge log files

Powershell question
Currently i have 5-10 log files all about 20-25GB each and need to search through each of them to check if any of 900 different search parameters match. i have written a basic powershell script that will search through the whole log file for 1 search parameter. if it matches it will dump out the results into a seperate text file, the problem is it is pretty slow. i was wondering if there is a way to speed this up by either making it search for all 900 parameters at once and only looking through the log once. any help would be good even if its just improving the script.
basic overview :
1 csv file with all the 900 items listed under an "item" column
1 log file (.txt)
1 result file (.txt)
1 ps1 file
here is the code i have below for powershell in a PS1 file:
$search = filepath to csv file<br>
$log = "filepath to log file"<br>
$result = "file path to result text file"<br>
$list = import-csv $search <br>
foreach ($address in $list) {<br>
Get-Content $log | Select-String $address.item | add-content $result <br>
*"#"below is just for displaying a rudimentary counter of how far through searching it is <br>*
$i = $i + 1 <br>
echo $i <br>
}

900 search terms is quite large a group. Can you reduce its size by using regular expressions? A trivial solution is based on reading the file row-by-row and looking for matches. Set up a collection that contains regexps or literal strings for search terms. Like so,
$terms = #("Keyword[12]", "KeywordA", "KeyphraseOne") # Array of regexps
$src = "path-to-some-huge-file" # Path to the file
$reader = new-object IO.StreamReader($src) # Stream reader to file
while(($line = $reader.ReadLine()) -ne $null){ # Read one row at a time
foreach($t in $terms) { # For each search term...
if($line -match $t) { # check if the line read is a match...
$("Hit: {0} ({1})" -f $line, $t) # and print match
}
}
}
$reader.Close() # Close the reader

Surely this is going to be incredibly painful on any parser you use just based on the file sizes you have there, but if your log files are of a format that is standard (for example IIS log files) then you could consider using a Log parsing app such as Log Parser Studio instead of Powershell?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to remove small contigs from fasta files? - shell

Related

How to Pass more than one file in perl MY function [Perl]

Executing Perl script from windows-command line with 2 entry

file organisation in windows using perl

Perl - Using Variables from an Input File in the URL when a Variable has a Space (two words)

Searching Multiple Strings in Huge log files

Categories

Resources