Windows - Using perl monitor a directory for a new file drop/creation - windows

Looking for a way to monitor a directory for a new file creation or a drop.
so if I have a folder c:\temp and if an abc.txt is copied/created in this I want an event or something so that I can pick up that file and then process it.
Also, I want continuous monitoring of this folder. How can I do that. I am writing a service which does all this. I want to incorporate monitoring and processing in one script.
Thanks in advance.

The answer is here: In Perl, how can I watch a directory for changes?
For Linux:
use File::ChangeNotify;
my $watcher = File::ChangeNotify->instantiate_watcher(
directories => [ 'archive/oswiostat' ],
filter => qr/\Aoracleapps[.].*dat\z/,
);
while (my #events = $watcher->wait_for_events) {
# ...
}
I think you are using Windows so you have to use Win32::ChangeNotify
example from: http://www.perlmonks.org/?node_id=306175
use strict;
use Win32::ChangeNotify;
our $PATH ||= '.';
our $S = defined $S ? 1 : 0;
my $notify = Win32::ChangeNotify->new( $PATH, $S, 'FILE_NAME' );
my %last; #last{ glob $PATH . '/*' } = ();
while( 1 ) {
print('Nothing changed'), next
unless $notify->wait( 10_000 ); # Check every 10 seconds
$notify->reset;
print 'Something changed';
my #files = glob $PATH . '/*';
if( #files < scalar keys %last ) {
delete #last{ #files };
print 'These files where deleted: ';
print for keys %last;
}
elsif( #files > scalar keys %last ) {
my %temp;
#temp{ #files } = ();
delete #temp{ keys %last };
print 'These files where created: ';
print for keys %temp;
}
else {
print "A non-deletion or creation change occured";
}
undef %last;
#last{ #files } = ();
}

Related

Prepend folder name to paths in an array

I'm looking to prepend a folder name to the start of an array of (relative) paths using a foreach statement, but it's not making any changes to the array (no errors either)
Note: This is more for educational purposes than functional as I have it working using a for loop which I've commented out, but I'm interested in learning how the foreach statement works
$myFiles = #(
"blah1\blah2\file1.txt"
"blah3\blah4\file2.txt"
"blah5\blah6\file3.txt"
)
$checkoutFolder = "folder1"
#for ($h = 0; $h -lt $myFiles.Length; $h++) {
#$myFiles[$h] = $checkoutFolder + "\" + $myFiles[$h]
#}
foreach ($path in $myFiles) {
$path = $checkoutFolder + "\" + $path
}
$myFiles
I also tried using a buffer variable e.g.
$buffer = $checkoutFolder + "\" + $path
$path = $buffer
But same result i.e.
OUTPUT:
blah1\blah2\file1.txt
blah3\blah4\file2.txt
blah5\blah6\file3.txt
I could think of two ways:
Create new array with modified data of old array
$myFiles = #(
"blah1\blah2\file1.txt"
"blah3\blah4\file2.txt"
"blah5\blah6\file3.txt"
)
$checkoutFolder = "folder1"
#Create new array $myFilesnew
$myFilesnew = #()
#For each line in in old array
foreach ($file in $myFiles)
{
#Create new row from modied row $file of $myFiles array
$row = $checkoutFolder+"\"+$file
#Add row $row to a new array $myFilesnew
$myFilesnew+=$row
}
$myFilesnew
Modify each row of existing array:
$myFiles = #(
"blah1\blah2\file1.txt"
"blah3\blah4\file2.txt"
"blah5\blah6\file3.txt"
)
$checkoutFolder = "folder1"
$i=0
while($i-lt $myFiles.Count)
{
#Get $i row $myFiles[$i] from aray, perform insert of modified data, write data back to $myFiles[$i] row of the array
$myfiles[$i]=$myFiles[$i].Insert(0,$checkoutFolder+"\");
#Add +1 to $i
$i++
}
$myFiles
Better start using the Join-Path cmdlet to avoid creating paths with backslashes omitted or doubled.
Something like this would do it:
$checkoutFolder = "folder1"
$myFiles = "blah1\blah2\file1.txt", "blah3\blah4\file2.txt", "blah5\blah6\file3.txt" | ForEach-Object {
Join-Path -Path $checkoutFolder -ChildPath $_
}
$myFiles
output:
folder1\blah1\blah2\file1.txt
folder1\blah3\blah4\file2.txt
folder1\blah5\blah6\file3.txt
You can replace the regex beginning of each string with the folder name. This is a useful idiom for generating computernames too.
'blah1\blah2\file1.txt','blah3\blah4\file2.txt','blah5\blah6\file3.txt' -replace '^','folder1\'
folder1\blah1\blah2\file1.txt
folder1\blah3\blah4\file2.txt
folder1\blah5\blah6\file3.txt

Adding a new position at the end of the file Shell or Perl

My question is how to add a new position at the end of the file in Shell or Perl?
I have two files:
File A with 536382 lines and the key is third column:
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
File B with 946 lines and the key is the first column:
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
I want a combination of these two files:
abc1111,1070X00Y0,9999,B,West,Morning
abc2222,1070X00Y0,9999,B,West,Morning
abc3333,1070x00Y0,9999,B,West,Morning
I was trying shell script but I was receiving a message of out of memory.
So I open for suggestions.
Thank you, so far.
I was able to get the results you want by making a few changes to your code.
#!/usr/bin/perl
use strict;
use warnings;
open IN2, '<', \<<EOF;
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
EOF
my %hash;
while ( <IN2> ) {
chomp;
my #col2 = split ",";
$hash{$col2[0]}{$col2[1]} = $col2[2];
}
open IN1, '<', \<<EOF;
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
EOF
while ( <IN1> ) {
chomp;
my $key = (split /,/)[2];
if ( exists( $hash{$key} ) ) {
print join(",", $_, #{ $hash{$key} }{ qw/Position Time/ }), "\n";
}
}
This produced output of:
abc1111,1070X00Y0,**9999**,B,West,Morning
abc2222,1070X00Y0,**9999**,B,West,Morning
abc3333,1070x00Y0,**9999**,B,West,Morning
The changes to the code were
$hash{$col2[0]}{$col2[1]} = $col2[2]; Create a Hash of Hash to hold the Position and Time keys. They are used in a hash slice here
#{ $hash{$key} }{ qw/Position Time/ })
Convert small file into perl's hash
Process big file line by line

Downloading multiple fasta files from ncbi

I'm trying to download all fasta files associated with one organism from ncbi.
I tried wget -r -l3 -A "*.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/ to get all files ending in .fna.gz from the third level down, but then it just rejects everything with the following output:
Removed “ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/latest_assembly_versions/.listing”.
Rejecting “GCF_000010625.1_ASM1062v1”.
Rejecting “GCF_000307995.1_ASM30799v2”.
Rejecting “GCF_000312165.1_ASM31216v1”.
Rejecting “GCF_000312185.1_ASM31218v1”.
Rejecting “GCF_000312205.1_ASM31220v1”.
Rejecting “GCF_000312225.1_ASM31222v1”.
Rejecting “GCF_000312245.1_ASM31224v1”.
Rejecting “GCF_000312265.1_ASM31226v1”.
Rejecting “GCF_000312285.1_ASM31228v1”.
Rejecting “GCF_000312725.1_ASM31272v1”.
Rejecting “GCF_000330925.1_MicAerT1.0”.
Rejecting “GCF_000332585.1_MicAerD1.0”.
Rejecting “GCF_000412595.1_spc777-v1”.
Rejecting “GCF_000599945.1_Mic70051.0”.
Rejecting “GCF_000787675.1_ASM78767v1”.
Rejecting “GCF_000981785.1_ASM98178v1”.
Any ideas on why it's rejecting these directories? Thanks for your help.
Not exactly sure why it's rejecting your request, but when I was still doing this type of thing, I found that if I don't download queries in smaller batches, the NCBI server timed me out and blocked my IP for a while before I could download again. This doesn't seem to be the same problem that your seeing, but maybe this script might get the same thing done. Let me know if this helps.
#!/usr/bin/env python
from Bio import Entrez
search_term = raw_input("Organism name: ")
Entrez.email = "your_email#isp.com" # required by NCBI
search_handle = Entrez.esearch(db="nucleotide", term=search_term, usehistory="y")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
batch_size = 5 # download sequences in batches so NCBI doesn't time you out
with open("ALL_SEQ.fasta", "w") as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print "Going to download record %i to %i" % (start+1, end)
fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text",retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key)
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
print ("\nDownload completed")
I found a perl script that gets me close to accomplishing this task from here . Unfortunately, this script is just returning the ID's of the genomes, and not the actual sequences.
For example, the head of my output is:
gi|425458296|ref|NZ_CAIN00000000.1|NZ_CAIN01000000 Microcystis aeruginosa PCC 9808, whole genome shotgun sequencing project
gi|425448636|ref|NZ_CAIK00000000.1|NZ_CAIK01000000 Microcystis aeruginosa PCC 7941, whole genome shotgun sequencing project
Any perl users know what's going on?
use strict;
use LWP::Simple;
my ($name, $outname, $url, $xml, $out, $count, $query_key, $webenv, $ids);
my #genomeId;
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $limit = 'wgs[prop]+AND+srcdb+refseq[prop])';
my #species = ('Microcystis aeruginosa');
foreach my $s (#species) {
undef #genomeId;
$query_key = $webenv = '';
$s =~ s/ /+/g;
# ESearch
$url = $base . "esearch.fcgi?db=genome&term=$s";
$xml = get($url);
$count = $1 if ($xml =~ /<Count>(\d+)<\/Count>/);
if ($count > 30) {
$url = $base . "esearch.fcgi?db=genome&term=$s&retmax=$count";
$xml = get($url);
}
while ($xml =~ /<Id>(\d+?)<\/Id>/gs) {
push(#genomeId, $1);
}
$ids = join(',', #genomeId);
# ELink
$url = $base . "elink.fcgidbfrom=genome&db=nuccore&cmd=neighbor_history&id=$ids&term=$limit";
$xml = get($url);
$query_key = $1 if ($xml =~ /<QueryKey>(\d+)<\/QueryKey>/);
$webenv = $1 if ($xml =~ /<WebEnv>(\S+)<\/WebEnv>/);
# EFetch
$url = $base . "efetch.fcgidb=nuccore&query_key=$query_key&WebEnv=$webenv&rettype=fasta&retmode=text";
$out = get($url);
open (OUT, ">$s.fna");
close OUT;
}

Need simple parallelism example in Perl 6

I am trying to learn Perl 6 and parallelism/concurrency at the same time.
For a simple learning exercise, I have a folder of 550 '.htm' files and I want the total sum of lines of code among all of them. So far, I have this:
use v6;
my $start_time = now;
my $exception;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
for #files -> $file {
$total_lines += $file.lines.elems;
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
say $total_lines;
say now - $start_time;
That gives a sum of 577,449 in approximately 3 seconds.
How would I rewrite that to take advantage of Perl 6 parallelism ideas? I realize the time saved won't be much but it will work as proof of concept.
Implementing Christoph's suggestion. The count is slightly higher than my original post because I'm now able to read in the malformed UTF-8 files using encode latin1.
use v6;
my $start_time = now;
my #files = "c:/iforms/live".IO.dir(test => / '.' htm $/);
my $total_lines = [+] #files.race.map(*.lines(:enc<latin1>).elems);
say $total_lines;
say now - $start_time;
use v6;
my $start_time = now;
my $exception;
my #lines;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
await do for #files -> $file {
start {
#lines.push( $file.lines.elems );
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
}
$total_lines = [+] #lines;
say $total_lines;
say now - $start_time;

Scraping multiple items off of a Page into a Neat Row

As an example:
I load in the input from a .txt:
Benjamin,Schuvlein,Germany,1912,M,White
I do some code that I will not post here for brevity and get to the link:
https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ
I want to scrape multiple things from that page. In the code below, I only do 1.
I'd also like to make each item be separated by a , in the output .txt.
And, I'd like the output to be preceded by the input.
I'm using the following packages in the code:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
Here's the relevant code:
my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ';
print "My URL is <$url>\n";
my $request = HTTP::Request->new(GET => $url);
$request->push_header('Content-Type' => 'application/json');
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
my $dom_tree = new HTML::DOM;
$dom_tree->write($response->content);
$dom_tree->close;
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
print $str;
print $o $str;
Desired Output (from that link) is something like:
Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....
(How much of that output section is scrapable?)
Any help on how to get the link within the link would be much appreciated!
This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.
I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.
X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
Am I to expect an email from you? I replied to your first mail but haven't heard since.
use strict;
use warnings;
use URI;
use LWP;
use HTML::TreeBuilder::XPath;
my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my #results = $tree->findnodes('//table[#class="result-data"]//tr[#class="result-item"]');
my %data;
for my $item (#results) {
my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
$key =~ s/:$//;
$data{$key} = $val;
}
my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
'event place', 'marital status', 'residence in 1935',
'relationship to head of household (standardized)';
print $record, "\n";
output
"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head
Try this
use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;
$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;
Which prints out
event place: = Assembly District 2, Queens, New York City, Queens, New York, United States
I also noticed this header:
X-Copyright:COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
See also http://metacpan.org/pod/HTML::Element#SYNOPSIS
I thought I had answered your question.
The problem is that you are trying to fetch the webpage with LWP. Why are try to doing that if you already have WWW::Mechanize::Firefox?
Did you tried this?
It will retrieve and save each link for further analyses. A small change and you 'get' the DOM tree. Sorry, I do not have acccess to this page, so I just hope it will work.
my $i=1;
for my $link (#links) {
print Dumper $link->url;
print Dumper $link->text;
my $tempfile = './$i.html';$i++;
$mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
my $dom_tree = $mech->document();
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();
}
EDIT:
Process the page content with regexp (Everyone: Please remember, there is always more than one way to do something wwith Perl!. It works, it is easy...)
it tried it out with this cmd:
wget -nd 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ' -O 1.html|cat 1.html|1.pl
use Data::Dumper;
use strict;
use warnings;
local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
my $td = $1;
print "td: $td\n";
my $td_val = $td;
$td_val =~ s!<[^>]*>!!gis;
$td_val =~ s!\s+! !gs;
$td_val =~ s!(\A\s+|\s+\z)!!gs;
if ($td =~ m!result-label!){ #primitive state machine, store the current label
print "current_label: $current_label\n";
$current_label = $td_val;
} elsif ($td =~ m!result-value!){ #add each data to current label
push(#{$data->{$current_label}},$td_val);
} else {
warn "found something else: $td\n";
}
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
process_multi($type,$data->{$type});
delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
process_simple($type,$data->{$type});
delete($data->{$type});
}
die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;
Output:
'line number:' => [
'28'
],
'estimated birth year:' => [
'1912'
],
'head' => [
'Benjamin Schuvlein',
'M',
'28',
'Germany'
],

Resources