Downloading multiple fasta files from ncbi - ftp

I'm trying to download all fasta files associated with one organism from ncbi.
I tried wget -r -l3 -A "*.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/ to get all files ending in .fna.gz from the third level down, but then it just rejects everything with the following output:
Removed “ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/latest_assembly_versions/.listing”.
Rejecting “GCF_000010625.1_ASM1062v1”.
Rejecting “GCF_000307995.1_ASM30799v2”.
Rejecting “GCF_000312165.1_ASM31216v1”.
Rejecting “GCF_000312185.1_ASM31218v1”.
Rejecting “GCF_000312205.1_ASM31220v1”.
Rejecting “GCF_000312225.1_ASM31222v1”.
Rejecting “GCF_000312245.1_ASM31224v1”.
Rejecting “GCF_000312265.1_ASM31226v1”.
Rejecting “GCF_000312285.1_ASM31228v1”.
Rejecting “GCF_000312725.1_ASM31272v1”.
Rejecting “GCF_000330925.1_MicAerT1.0”.
Rejecting “GCF_000332585.1_MicAerD1.0”.
Rejecting “GCF_000412595.1_spc777-v1”.
Rejecting “GCF_000599945.1_Mic70051.0”.
Rejecting “GCF_000787675.1_ASM78767v1”.
Rejecting “GCF_000981785.1_ASM98178v1”.
Any ideas on why it's rejecting these directories? Thanks for your help.

Not exactly sure why it's rejecting your request, but when I was still doing this type of thing, I found that if I don't download queries in smaller batches, the NCBI server timed me out and blocked my IP for a while before I could download again. This doesn't seem to be the same problem that your seeing, but maybe this script might get the same thing done. Let me know if this helps.
#!/usr/bin/env python
from Bio import Entrez
search_term = raw_input("Organism name: ")
Entrez.email = "your_email#isp.com" # required by NCBI
search_handle = Entrez.esearch(db="nucleotide", term=search_term, usehistory="y")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
batch_size = 5 # download sequences in batches so NCBI doesn't time you out
with open("ALL_SEQ.fasta", "w") as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print "Going to download record %i to %i" % (start+1, end)
fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text",retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key)
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
print ("\nDownload completed")

I found a perl script that gets me close to accomplishing this task from here . Unfortunately, this script is just returning the ID's of the genomes, and not the actual sequences.
For example, the head of my output is:
gi|425458296|ref|NZ_CAIN00000000.1|NZ_CAIN01000000 Microcystis aeruginosa PCC 9808, whole genome shotgun sequencing project
gi|425448636|ref|NZ_CAIK00000000.1|NZ_CAIK01000000 Microcystis aeruginosa PCC 7941, whole genome shotgun sequencing project
Any perl users know what's going on?
use strict;
use LWP::Simple;
my ($name, $outname, $url, $xml, $out, $count, $query_key, $webenv, $ids);
my #genomeId;
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $limit = 'wgs[prop]+AND+srcdb+refseq[prop])';
my #species = ('Microcystis aeruginosa');
foreach my $s (#species) {
undef #genomeId;
$query_key = $webenv = '';
$s =~ s/ /+/g;
# ESearch
$url = $base . "esearch.fcgi?db=genome&term=$s";
$xml = get($url);
$count = $1 if ($xml =~ /<Count>(\d+)<\/Count>/);
if ($count > 30) {
$url = $base . "esearch.fcgi?db=genome&term=$s&retmax=$count";
$xml = get($url);
}
while ($xml =~ /<Id>(\d+?)<\/Id>/gs) {
push(#genomeId, $1);
}
$ids = join(',', #genomeId);
# ELink
$url = $base . "elink.fcgidbfrom=genome&db=nuccore&cmd=neighbor_history&id=$ids&term=$limit";
$xml = get($url);
$query_key = $1 if ($xml =~ /<QueryKey>(\d+)<\/QueryKey>/);
$webenv = $1 if ($xml =~ /<WebEnv>(\S+)<\/WebEnv>/);
# EFetch
$url = $base . "efetch.fcgidb=nuccore&query_key=$query_key&WebEnv=$webenv&rettype=fasta&retmode=text";
$out = get($url);
open (OUT, ">$s.fna");
close OUT;
}

Related

Powershell, download CSV into string

In Powershell, instead of downloaded a CSV file to drive from a web request, can you download a file directly into a string? I often pull earthquake data from the USGS database which outputs the data as a downloadable CVS file. The issue is data is pulled per day and when pulling years of data I end up pulling a thousand or more files, then pipe them together. If I can download directly into a string I could eliminate the read write file time needed for a thousand or more files by doing it all in memory.
The basic portion of code used to pull the file is:
$U = $env:userprofile
$Location = "Region_Earthquakes"
$MaxLat = "72.427"
$MinLat = "50.244"
$MaxLon = "-140.625"
$MinLon = "-176.133"
$yearspulled = 1
$ts = New-TimeSpan -Days 1
$Today = Get-Date
$StartDate = [datetime](((Get-Date -format "yyyy") - $yearspulled + 1).ToString() + "-01-01 00:00:00")
While ($StartDate -le $Today)
{$EndDate = $Startdate + $ts
$Start = $StartDate.ToString("yyyy-MM-dd HH:mm:ss")
$FileDate = $StartDate.ToString("yyyy_MM_dd")
$End = $EndDate.ToString("yyyy-MM-dd HH:mm:ss")
$url = "http://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=$Start&endtime=$End&minmagnitude=-1.00&maxmagnitude=10.00&maxlatitude=$MaxLat&minlatitude=$MinLat&maxlongitude=$MaxLon&minlongitude=$MinLon&eventtype=earthquake&orderby=time"
$output = "$U\DownLoads\$Location" + "_$FileDate.csv"
(New-Object System.Net.WebClient).DownloadFile($url, $output)
Write-Output "Date pulled for $Start"
$Startdate = $Startdate + $ts}
When in doubt, read the documentation. The System.Net.WebClient class has a method DownloadString() that does exactly what you're asking.
WebClient.DownloadString Method
Namespace: System.Net
Assemblies: System.dll, netstandard.dll, System.Net.WebClient.dll, System.Net.dll
Downloads the requested resource as a String. The resource to download may be specified as either String containing the URI or a Uri.
Emphasis mine.
$s = (New-Object System.Net.WebClient).DownloadString($url)

Need simple parallelism example in Perl 6

I am trying to learn Perl 6 and parallelism/concurrency at the same time.
For a simple learning exercise, I have a folder of 550 '.htm' files and I want the total sum of lines of code among all of them. So far, I have this:
use v6;
my $start_time = now;
my $exception;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
for #files -> $file {
$total_lines += $file.lines.elems;
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
say $total_lines;
say now - $start_time;
That gives a sum of 577,449 in approximately 3 seconds.
How would I rewrite that to take advantage of Perl 6 parallelism ideas? I realize the time saved won't be much but it will work as proof of concept.
Implementing Christoph's suggestion. The count is slightly higher than my original post because I'm now able to read in the malformed UTF-8 files using encode latin1.
use v6;
my $start_time = now;
my #files = "c:/iforms/live".IO.dir(test => / '.' htm $/);
my $total_lines = [+] #files.race.map(*.lines(:enc<latin1>).elems);
say $total_lines;
say now - $start_time;
use v6;
my $start_time = now;
my $exception;
my #lines;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
await do for #files -> $file {
start {
#lines.push( $file.lines.elems );
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
}
$total_lines = [+] #lines;
say $total_lines;
say now - $start_time;

Windows - Using perl monitor a directory for a new file drop/creation

Looking for a way to monitor a directory for a new file creation or a drop.
so if I have a folder c:\temp and if an abc.txt is copied/created in this I want an event or something so that I can pick up that file and then process it.
Also, I want continuous monitoring of this folder. How can I do that. I am writing a service which does all this. I want to incorporate monitoring and processing in one script.
Thanks in advance.
The answer is here: In Perl, how can I watch a directory for changes?
For Linux:
use File::ChangeNotify;
my $watcher = File::ChangeNotify->instantiate_watcher(
directories => [ 'archive/oswiostat' ],
filter => qr/\Aoracleapps[.].*dat\z/,
);
while (my #events = $watcher->wait_for_events) {
# ...
}
I think you are using Windows so you have to use Win32::ChangeNotify
example from: http://www.perlmonks.org/?node_id=306175
use strict;
use Win32::ChangeNotify;
our $PATH ||= '.';
our $S = defined $S ? 1 : 0;
my $notify = Win32::ChangeNotify->new( $PATH, $S, 'FILE_NAME' );
my %last; #last{ glob $PATH . '/*' } = ();
while( 1 ) {
print('Nothing changed'), next
unless $notify->wait( 10_000 ); # Check every 10 seconds
$notify->reset;
print 'Something changed';
my #files = glob $PATH . '/*';
if( #files < scalar keys %last ) {
delete #last{ #files };
print 'These files where deleted: ';
print for keys %last;
}
elsif( #files > scalar keys %last ) {
my %temp;
#temp{ #files } = ();
delete #temp{ keys %last };
print 'These files where created: ';
print for keys %temp;
}
else {
print "A non-deletion or creation change occured";
}
undef %last;
#last{ #files } = ();
}

Scraping multiple items off of a Page into a Neat Row

As an example:
I load in the input from a .txt:
Benjamin,Schuvlein,Germany,1912,M,White
I do some code that I will not post here for brevity and get to the link:
https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ
I want to scrape multiple things from that page. In the code below, I only do 1.
I'd also like to make each item be separated by a , in the output .txt.
And, I'd like the output to be preceded by the input.
I'm using the following packages in the code:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
Here's the relevant code:
my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ';
print "My URL is <$url>\n";
my $request = HTTP::Request->new(GET => $url);
$request->push_header('Content-Type' => 'application/json');
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
my $dom_tree = new HTML::DOM;
$dom_tree->write($response->content);
$dom_tree->close;
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
print $str;
print $o $str;
Desired Output (from that link) is something like:
Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....
(How much of that output section is scrapable?)
Any help on how to get the link within the link would be much appreciated!
This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.
I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.
X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
Am I to expect an email from you? I replied to your first mail but haven't heard since.
use strict;
use warnings;
use URI;
use LWP;
use HTML::TreeBuilder::XPath;
my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my #results = $tree->findnodes('//table[#class="result-data"]//tr[#class="result-item"]');
my %data;
for my $item (#results) {
my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
$key =~ s/:$//;
$data{$key} = $val;
}
my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
'event place', 'marital status', 'residence in 1935',
'relationship to head of household (standardized)';
print $record, "\n";
output
"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head
Try this
use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;
$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;
Which prints out
event place: = Assembly District 2, Queens, New York City, Queens, New York, United States
I also noticed this header:
X-Copyright:COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
See also http://metacpan.org/pod/HTML::Element#SYNOPSIS
I thought I had answered your question.
The problem is that you are trying to fetch the webpage with LWP. Why are try to doing that if you already have WWW::Mechanize::Firefox?
Did you tried this?
It will retrieve and save each link for further analyses. A small change and you 'get' the DOM tree. Sorry, I do not have acccess to this page, so I just hope it will work.
my $i=1;
for my $link (#links) {
print Dumper $link->url;
print Dumper $link->text;
my $tempfile = './$i.html';$i++;
$mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
my $dom_tree = $mech->document();
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();
}
EDIT:
Process the page content with regexp (Everyone: Please remember, there is always more than one way to do something wwith Perl!. It works, it is easy...)
it tried it out with this cmd:
wget -nd 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ' -O 1.html|cat 1.html|1.pl
use Data::Dumper;
use strict;
use warnings;
local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
my $td = $1;
print "td: $td\n";
my $td_val = $td;
$td_val =~ s!<[^>]*>!!gis;
$td_val =~ s!\s+! !gs;
$td_val =~ s!(\A\s+|\s+\z)!!gs;
if ($td =~ m!result-label!){ #primitive state machine, store the current label
print "current_label: $current_label\n";
$current_label = $td_val;
} elsif ($td =~ m!result-value!){ #add each data to current label
push(#{$data->{$current_label}},$td_val);
} else {
warn "found something else: $td\n";
}
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
process_multi($type,$data->{$type});
delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
process_simple($type,$data->{$type});
delete($data->{$type});
}
die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;
Output:
'line number:' => [
'28'
],
'estimated birth year:' => [
'1912'
],
'head' => [
'Benjamin Schuvlein',
'M',
'28',
'Germany'
],

Perl Out Of Memory

I have a script that reads two csv files and compares them to find out if an ID that appears in one also appears in the other. The error I am receiving is as follows:
Out of memory during "large" request for 67112960 bytes, total sbrk() is 348203008 bytes
And now for the code:
use strict;
use File::Basename;
my $DAT = $ARGV[0];
my $OPT = $ARGV[1];
my $beg_doc = $ARGV[2];
my $end_doc = $ARGV[3];
my $doc_counter = 0;
my $page_counter = 0;
my %opt_beg_docs;
my %beg_docs;
my ($fname, $dir, $suffix) = fileparse($DAT, qr/\.[^.]*/);
my $outfile = $dir . $fname . "._IMGLOG";
open(OPT, "<$OPT");
while(<OPT>){
my #OPT_Line = split(/,/, $_);
$beg_docs{#OPT_Line[0]} = "Y" if(#OPT_Line[3] eq "Y");
$opt_beg_docs{#OPT_Line[0]} = "Y";
}
close(OPT);
open(OUT, ">$outfile");
while((my $key, my $value) = each %opt_beg_docs){
print OUT "$key\n";
}
close(OUT);
open(DAT, "<$DAT");
readline(DAT); #skips header line
while(<DAT>){
$_ =~ s/\xFE//g;
my #DAT_Line = split(/\x14/, $_);
#gets the prefix and the range of the beg and end docs
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
(my $beg = #DAT_Line[$beg_doc]) =~ s/\D//g;
(my $end = #DAT_Line[$end_doc]) =~ s/\D//g;
#print OUT "BEGDOC: $beg ENDDOC: $end\n";
foreach($beg .. $end){
my $doc_id = $pre . $_;
if($opt_beg_docs{$doc_id} ne "Y"){
if($beg_docs{$doc_id} ne "Y"){
print OUT "$doc_id,DOCUMENT NOT FOUND IN OPT FILE\n";
$doc_counter++;
} else {
print OUT "$doc_id,PAGE NOT FOUND IN OPT FILE\n";
$page_counter++;
}
}
}
}
close(DAT);
close(OUT);
print "Found $page_counter missing pages and $doc_counter missing document(s)";
Basically I get all the ID's from the file I am checking against to see if the ID exists in. Then I loop over the and generate the ID's for the other file, because they are presented as a range. Then I take the generated ID and check for it in the hash of ID's.
Also forgot to note I am using Windows
You're not using use warnings;, you're not checking for errors on opening files, and you're not printing out debugging statements showing the lines that you are reading in.
Do you know what the input file looks like? If it has no line breaks, you are reading the entire file in all at once, which will be disastrous if it is large. Pay attention to how you are parsing the file.
I'm not sure if it's the cause of your error, but inside your loop where you're reading DAT, you probably want to replace this:
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
with this:
(my $pre = $DAT_Line[$beg_doc]) =~ s/[0-9]//g;
and same for the other two lines there.
You're closing your OUT file handle and then trying to print to it inside the DAT loop, which, I think might be outputting to random memory, since you closed the FILEHANDLE - surprised this didn't output an error.
Remove the first close(OUT); and see if that improves.
I still don't know what your question is, if it's about the error message it means you've run out of memory. If it's about the message itself - you're trying to consume too much memory. If it's why you're consuming too much memory, I'd first ask if you read my message above, then I'd ask how much memory your system has, then I'd follow up with seeing if it improves if you take the regex away.

Resources