Scraping multiple items off of a Page into a Neat Row

Scraping multiple items off of a Page into a Neat Row - ajax

As an example:
I load in the input from a .txt:
Benjamin,Schuvlein,Germany,1912,M,White
I do some code that I will not post here for brevity and get to the link:
https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ
I want to scrape multiple things from that page. In the code below, I only do 1.
I'd also like to make each item be separated by a , in the output .txt.
And, I'd like the output to be preceded by the input.
I'm using the following packages in the code:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
Here's the relevant code:
my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ';
print "My URL is <$url>\n";
my $request = HTTP::Request->new(GET => $url);
$request->push_header('Content-Type' => 'application/json');
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
my $dom_tree = new HTML::DOM;
$dom_tree->write($response->content);
$dom_tree->close;
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
print $str;
print $o $str;
Desired Output (from that link) is something like:
Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....
(How much of that output section is scrapable?)
Any help on how to get the link within the link would be much appreciated!

This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.
I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.
X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
Am I to expect an email from you? I replied to your first mail but haven't heard since.
use strict;
use warnings;
use URI;
use LWP;
use HTML::TreeBuilder::XPath;
my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my #results = $tree->findnodes('//table[#class="result-data"]//tr[#class="result-item"]');
my %data;
for my $item (#results) {
my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
$key =~ s/:$//;
$data{$key} = $val;
}
my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
'event place', 'marital status', 'residence in 1935',
'relationship to head of household (standardized)';
print $record, "\n";
output
"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head

Try this
use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;
$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;
Which prints out
event place: = Assembly District 2, Queens, New York City, Queens, New York, United States
I also noticed this header:
X-Copyright:COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
See also http://metacpan.org/pod/HTML::Element#SYNOPSIS

I thought I had answered your question.
The problem is that you are trying to fetch the webpage with LWP. Why are try to doing that if you already have WWW::Mechanize::Firefox?
Did you tried this?
It will retrieve and save each link for further analyses. A small change and you 'get' the DOM tree. Sorry, I do not have acccess to this page, so I just hope it will work.
my $i=1;
for my $link (#links) {
print Dumper $link->url;
print Dumper $link->text;
my $tempfile = './$i.html';$i++;
$mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
my $dom_tree = $mech->document();
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();
}
EDIT:
Process the page content with regexp (Everyone: Please remember, there is always more than one way to do something wwith Perl!. It works, it is easy...)
it tried it out with this cmd:
wget -nd 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ' -O 1.html|cat 1.html|1.pl
use Data::Dumper;
use strict;
use warnings;
local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
my $td = $1;
print "td: $td\n";
my $td_val = $td;
$td_val =~ s!<[^>]*>!!gis;
$td_val =~ s!\s+! !gs;
$td_val =~ s!(\A\s+|\s+\z)!!gs;
if ($td =~ m!result-label!){ #primitive state machine, store the current label
print "current_label: $current_label\n";
$current_label = $td_val;
} elsif ($td =~ m!result-value!){ #add each data to current label
push(#{$data->{$current_label}},$td_val);
} else {
warn "found something else: $td\n";
}
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
process_multi($type,$data->{$type});
delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
process_simple($type,$data->{$type});
delete($data->{$type});
}
die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;
Output:
'line number:' => [
'28'
],
'estimated birth year:' => [
'1912'
],
'head' => [
'Benjamin Schuvlein',
'M',
'28',
'Germany'
],

Related

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.
My target data volume range is about 20-30 GB in uncompressed flat files.
In terms of lines, a file can contain anywhere from 18 million to 28 million lines.
It has around 40-50 columns per line.
I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.
comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).
All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.
Now I am trying to bring down the total processing time even further.
Here is my current single threaded approach using perl5.
fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).
I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?
What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.
I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.
Thanks.
EDIT
As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:
I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
Is it possible to compare the data (stored in hash) in parallel?
My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.
package COMP;
use strict;
use Data::Dumper;
sub comp
{
my ($data,$src,$tgt) = #_;
my $result = {};
my $ms = ($result->{ms} = {});
my $mt = ($result->{mt} = {});
my $diff = ($result->{diff} = {});
foreach my $key (keys %{$data->{$src}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$tgt}{$key}) {
push (#{$mt->{$key}}, "$src_val|NULL");
}
if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
push (#{$diff->{$key}}, "$src_val|$tgt_val")
}
}
foreach my $key (keys %{$data->{$tgt}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$src}{$key}) {
push (#{$ms->{$key}},"NULL|$tgt_val");
}
}
return $result;
}
1;
If someone would like to try it out, here is the sample output and the test script used.
script output
[User#Host:]$ perl testCOMP.pl
$VAR1 = {
'mt' => {
'Source' => [
'source|NULL'
]
},
'ms' => {
'Target' => [
'NULL|target'
]
},
'diff' => {
'Sunday_isit' => [
'Yes|No'
]
}
};
Test Script
[User#Host:]$ cat testCOMP.pl
#!/usr/bin/env perl
use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;
my $data2 = {
f1 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'Yes',
Source => 'source',
},
f2 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'No',
Target => 'target',
},
};
my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User#Host:]$

If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider
allnum.pl6
use v6;
my #processes =
[ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
.map( -> $filename {
[ $filename, run "perl", "num.pl", $filename, :out ];
})
.hyper;
say "Lazyness Here!";
my $time = time;
for #processes
{
say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";
num.pl
use warnings;
use strict;
my $file = shift #ARGV;
my $start = time;
my $result = 0;
open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
chomp $thing;
$thing =~ s/ //g;
$result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";
On my system
C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s
Files were set up like so
C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt

Downloading multiple fasta files from ncbi

I'm trying to download all fasta files associated with one organism from ncbi.
I tried wget -r -l3 -A "*.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/ to get all files ending in .fna.gz from the third level down, but then it just rejects everything with the following output:
Removed “ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/latest_assembly_versions/.listing”.
Rejecting “GCF_000010625.1_ASM1062v1”.
Rejecting “GCF_000307995.1_ASM30799v2”.
Rejecting “GCF_000312165.1_ASM31216v1”.
Rejecting “GCF_000312185.1_ASM31218v1”.
Rejecting “GCF_000312205.1_ASM31220v1”.
Rejecting “GCF_000312225.1_ASM31222v1”.
Rejecting “GCF_000312245.1_ASM31224v1”.
Rejecting “GCF_000312265.1_ASM31226v1”.
Rejecting “GCF_000312285.1_ASM31228v1”.
Rejecting “GCF_000312725.1_ASM31272v1”.
Rejecting “GCF_000330925.1_MicAerT1.0”.
Rejecting “GCF_000332585.1_MicAerD1.0”.
Rejecting “GCF_000412595.1_spc777-v1”.
Rejecting “GCF_000599945.1_Mic70051.0”.
Rejecting “GCF_000787675.1_ASM78767v1”.
Rejecting “GCF_000981785.1_ASM98178v1”.
Any ideas on why it's rejecting these directories? Thanks for your help.

Not exactly sure why it's rejecting your request, but when I was still doing this type of thing, I found that if I don't download queries in smaller batches, the NCBI server timed me out and blocked my IP for a while before I could download again. This doesn't seem to be the same problem that your seeing, but maybe this script might get the same thing done. Let me know if this helps.
#!/usr/bin/env python
from Bio import Entrez
search_term = raw_input("Organism name: ")
Entrez.email = "your_email#isp.com" # required by NCBI
search_handle = Entrez.esearch(db="nucleotide", term=search_term, usehistory="y")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
batch_size = 5 # download sequences in batches so NCBI doesn't time you out
with open("ALL_SEQ.fasta", "w") as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print "Going to download record %i to %i" % (start+1, end)
fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text",retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key)
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
print ("\nDownload completed")

I found a perl script that gets me close to accomplishing this task from here . Unfortunately, this script is just returning the ID's of the genomes, and not the actual sequences.
For example, the head of my output is:
gi|425458296|ref|NZ_CAIN00000000.1|NZ_CAIN01000000 Microcystis aeruginosa PCC 9808, whole genome shotgun sequencing project
gi|425448636|ref|NZ_CAIK00000000.1|NZ_CAIK01000000 Microcystis aeruginosa PCC 7941, whole genome shotgun sequencing project
Any perl users know what's going on?
use strict;
use LWP::Simple;
my ($name, $outname, $url, $xml, $out, $count, $query_key, $webenv, $ids);
my #genomeId;
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $limit = 'wgs[prop]+AND+srcdb+refseq[prop])';
my #species = ('Microcystis aeruginosa');
foreach my $s (#species) {
undef #genomeId;
$query_key = $webenv = '';
$s =~ s/ /+/g;
# ESearch
$url = $base . "esearch.fcgi?db=genome&term=$s";
$xml = get($url);
$count = $1 if ($xml =~ /<Count>(\d+)<\/Count>/);
if ($count > 30) {
$url = $base . "esearch.fcgi?db=genome&term=$s&retmax=$count";
$xml = get($url);
}
while ($xml =~ /<Id>(\d+?)<\/Id>/gs) {
push(#genomeId, $1);
}
$ids = join(',', #genomeId);
# ELink
$url = $base . "elink.fcgidbfrom=genome&db=nuccore&cmd=neighbor_history&id=$ids&term=$limit";
$xml = get($url);
$query_key = $1 if ($xml =~ /<QueryKey>(\d+)<\/QueryKey>/);
$webenv = $1 if ($xml =~ /<WebEnv>(\S+)<\/WebEnv>/);
# EFetch
$url = $base . "efetch.fcgidb=nuccore&query_key=$query_key&WebEnv=$webenv&rettype=fasta&retmode=text";
$out = get($url);
open (OUT, ">$s.fna");
close OUT;
}

Perl : ignore columns from csv file during insert

I'm inserting a CSV file into TAB,
Is there any way I can ignore 2nd and 3rd column from the CSV file during insert?
I tried changing $row[] values to $row[0],$row[3],$row[4],$row[5],$row[6] but this doesn't work. Tried few more things and also search inside SO but couldn't find what I'm looking for.
could somebody suggest or point me to link?
use warnings;
use strict;
use Text::CSV;
use DBD::Oracle;
my $exitStatus = 0;
dbConnect();
&insertRecords
#----------------
sub insertRecords {
my $csv;
my $fileToInsert = shift;
my $row;
my $SQL1;
my $sth;
my $rc;
open my $fh, "<", $fileToInsert or die "$filetoInsert: $!"
$SQL1 = "Insert into TAB1 (sample_date, server, first, n1, n2)
values (?,?,?,?,?)";
$sth = prepare($SQL1)
while ($row = $csv->getline ($fh)) {
$sth -> execute($row[0], $row[1], $row[2], $row[3], $row[4])
}
CSV File:
sample_date,
date_x1
x2
server
first
n1
n2

You need to first read the whole file ,then run the query.
Right now the query is run on every line. But you have each entry one a line - not per column as you write - so you might not need the CSV module? Anyway, there are several errors in there. I've simplified it a little.
try something like this:
use warnings;
use strict;
use DBD::Oracle;
my $exitStatus = 0;
dbConnect();
my $file = shift;
&insertRecords($file);
#----------------
sub insertRecords {
my $fileToInsert = shift;
my $fh;
open $fh, "<", $fileToInsert or die "$fileToInsert: $!";
my $SQL1 = "Insert into TAB1 (sample_date, server, first, n1, n2) values (?,?,?,?,?)";
my $sth = prepare($SQL1);
my #row = $fh->getlines;
print ($row[0], $row[3], $row[4], $row[5], $row[6]); # <-- always good for testing!
$sth -> execute($row[0], $row[3], $row[4], $row[5], $row[6]);
}

Codeigniter CSV upload then explode

I have some code that uploads the CSV file to the specified folder, but it doesn't update the database.
public function do_upload()
{
$csv_path = realpath(APPPATH . '/../assets/uploads/CSV/');
$config['upload_path'] = $csv_path;
$config['allowed_types'] = '*'; // All types of files allowed
$config['overwrite'] = true; // Overwrites the existing file
$this->upload->initialize($config);
$this->load->library('upload', $config);
if ( ! $this->upload->do_upload('userfile'))
{
$error = array('error' => $this->upload->display_errors());
$this->layout->buffer('content', 'program/upload', $error);
$this->layout->render();
}
else
{
$image_data = $this->upload->data();
$fname = $image_data['file_name'];
$fpath = $image_data['file_path'].$fname;
$fh = fopen($fpath, "r");
$insert_str = 'INSERT INTO wc_program (JobRef, Area, Parish, AbbrWorkType, WorkType, Timing, TrafficManagement, Location, Duration, Start, Finish) VALUES '."\n";
if ($fh) {
// Create each set of values.
while (($csv_row = fgetcsv($fh, 2000, ',')) !== false) {
foreach ($csv_row as &$row) {
$row = strtr($row, array("'" => "\'", '"' => '\"'));
}
$insert_str .= '("'
// Implode the array and fix pesky apostrophes.
.implode('","', $csv_row)
.'"),'."\n";
}
// Remove the trailing comma.
$insert_str = rtrim($insert_str, ",\n");
// Insert all of the values at once.
$this->db->set($insert_str);
echo '<script type="text/javascript">
alert("Document successfully uploaded and saved to the database.");
location = "program/index";
</script>';
}
else {
echo '<script type="text/javascript">
alert("Sorry! Something went wrong please proceed to try again.");
location = "program/upload";
</script>';
}
}
}
When I run var_dump($fh); it shows: resource(89) of type (stream)
When I run var_dump($fpath) it shows: string(66) "/Applications/MAMP/htdocs/site/assets/uploads/CSV/wc_program.csv"
So it all uploads but what is wrong with it not updating the database?
I have tried all kinds of changing the fopen method but still no joy, I really need it to add to the database and the insert query and set query should do the trick but it doesn't.
Any help greatly appreciated!

You are not running any query on the database. You are mixing active record syntax with simple query syntax. The active record insert query will be executed by calling.
$this->db->insert('my_table');
db::set() does not actually query the database. It takes in a key/value pair that will be inserted or updated after db::insert() or db::update() is called. If you build the query yourself you need to use the db::query() function.
Review the active directory documentation.
You can use $this->db->query('put your query here'), but you lose the benefit of CodeIgniter's built in security. Review CodeIgniter's query functions.
I'll give you examples of just a few of the many ways you can insert into a database using CodeIgniter. The examples will generate the query from your comment. You will need to adjust your code accordingly.
EXAMPLE 1:
$result = $this->db
->set('JobRef', 911847)
->set('Area', 'Coastal')
->set('Parish', 'Yapton')
->set('AbbrWorkType', 'Micro')
->set('WorkType', 'Micro-Asphalt Surfacing')
->set('Timing', 'TBC')
->set('TrafficManagement', 'No Positive Traffic Management')
->set('Location', 'Canal Road (added PMI 16/07/12)')
->set('Duration', '2 days')
->set('Start', '0000-00-00')
->set('Finish', '0000-00-00')
->insert('wc_program');
echo $this->db->last_query() . "\n\n";
echo "RESULT: \n\n";
print_r($result);
EXAMPLE 2 (Using an associative array):
$row = array(
'JobRef' => 911847,
'Area' => 'Coastal',
'Parish' => 'Yapton',
'AbbrWorkType' => 'Micro',
'WorkType' => 'Micro-Asphalt Surfacing',
'Timing' => 'TBC',
'TrafficManagement' => 'No Positive Traffic Management',
'Location' => 'Canal Road (added PMI 16/07/12)',
'Duration' => '2 days',
'Start' => '0000-00-00',
'Finish' => '0000-00-00'
);
$this->db->insert('wc_program', $row);
// This will do the same thing
// $this->db->set($row);
// $this->db->insert('wc_program');
echo $this->db->last_query();
Example 1 and 2 are using the Active Record. The information is stored piece by piece and then the query is built when you make the final call. This has several advantages. It allows you to build queries dynamically without worrying about SQL syntax and order of the keywords. It also escapes your data.
EXAMPLE 3 (Simple Query):
$query = 'INSERT INTO
wc_program
(JobRef, Area, Parish, AbbrWorkType, WorkType, Timing, TrafficManagement, Location, Duration, Start, Finish)
VALUES
("911847","Coastal","Yapton","Micro","Micro-Asphalt Surfacing","TBC","No Positive Traffic Management","Canal Road (added PMI 16/07/12)","2 days","0000-00-00","0000-00-00")';
$result = $this->db->query($query);
echo $this->db->last_query() . "\n\n";
echo "RESULT: \n";
print_r($result);
This way leaves all the protection against injection up to you, can lead to more errors, and is harder to change/maintain.
If you are going to do it this way you should use the following syntax, which will protect against injection.
EXAMPLE 4:
$query = 'INSERT INTO
wc_program
(JobRef, Area, Parish, AbbrWorkType, WorkType, Timing, TrafficManagement, Location, Duration, Start, Finish)
VALUES
(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);';
$row = array(
911847,
'Coastal',
'Yapton',
'Micro',
'Micro-Asphalt Surfacing',
'TBC',
'No Positive Traffic Management',
'Canal Road (added PMI 16/07/12)',
'2 days',
'0000-00-00',
'0000-00-00'
);
$result = $this->db->query($query, $row);
echo $this->db->last_query() . "\n\n";
echo "RESULT: \n";
print_r($result);
CodeIgniter will replace each "?" in the query with the corresponding value from the array after it is escaped. You can use this to run many queries that are of the same form, but have different data just by updating the $row array and benefit from CI's built in security.

Perl Out Of Memory

I have a script that reads two csv files and compares them to find out if an ID that appears in one also appears in the other. The error I am receiving is as follows:
Out of memory during "large" request for 67112960 bytes, total sbrk() is 348203008 bytes
And now for the code:
use strict;
use File::Basename;
my $DAT = $ARGV[0];
my $OPT = $ARGV[1];
my $beg_doc = $ARGV[2];
my $end_doc = $ARGV[3];
my $doc_counter = 0;
my $page_counter = 0;
my %opt_beg_docs;
my %beg_docs;
my ($fname, $dir, $suffix) = fileparse($DAT, qr/\.[^.]*/);
my $outfile = $dir . $fname . "._IMGLOG";
open(OPT, "<$OPT");
while(<OPT>){
my #OPT_Line = split(/,/, $_);
$beg_docs{#OPT_Line[0]} = "Y" if(#OPT_Line[3] eq "Y");
$opt_beg_docs{#OPT_Line[0]} = "Y";
}
close(OPT);
open(OUT, ">$outfile");
while((my $key, my $value) = each %opt_beg_docs){
print OUT "$key\n";
}
close(OUT);
open(DAT, "<$DAT");
readline(DAT); #skips header line
while(<DAT>){
$_ =~ s/\xFE//g;
my #DAT_Line = split(/\x14/, $_);
#gets the prefix and the range of the beg and end docs
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
(my $beg = #DAT_Line[$beg_doc]) =~ s/\D//g;
(my $end = #DAT_Line[$end_doc]) =~ s/\D//g;
#print OUT "BEGDOC: $beg ENDDOC: $end\n";
foreach($beg .. $end){
my $doc_id = $pre . $_;
if($opt_beg_docs{$doc_id} ne "Y"){
if($beg_docs{$doc_id} ne "Y"){
print OUT "$doc_id,DOCUMENT NOT FOUND IN OPT FILE\n";
$doc_counter++;
} else {
print OUT "$doc_id,PAGE NOT FOUND IN OPT FILE\n";
$page_counter++;
}
}
}
}
close(DAT);
close(OUT);
print "Found $page_counter missing pages and $doc_counter missing document(s)";
Basically I get all the ID's from the file I am checking against to see if the ID exists in. Then I loop over the and generate the ID's for the other file, because they are presented as a range. Then I take the generated ID and check for it in the hash of ID's.
Also forgot to note I am using Windows

You're not using use warnings;, you're not checking for errors on opening files, and you're not printing out debugging statements showing the lines that you are reading in.
Do you know what the input file looks like? If it has no line breaks, you are reading the entire file in all at once, which will be disastrous if it is large. Pay attention to how you are parsing the file.

I'm not sure if it's the cause of your error, but inside your loop where you're reading DAT, you probably want to replace this:
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
with this:
(my $pre = $DAT_Line[$beg_doc]) =~ s/[0-9]//g;
and same for the other two lines there.

You're closing your OUT file handle and then trying to print to it inside the DAT loop, which, I think might be outputting to random memory, since you closed the FILEHANDLE - surprised this didn't output an error.
Remove the first close(OUT); and see if that improves.
I still don't know what your question is, if it's about the error message it means you've run out of memory. If it's about the message itself - you're trying to consume too much memory. If it's why you're consuming too much memory, I'd first ask if you read my message above, then I'd ask how much memory your system has, then I'd follow up with seeing if it improves if you take the regex away.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scraping multiple items off of a Page into a Neat Row - ajax

Related

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

Downloading multiple fasta files from ncbi

Perl : ignore columns from csv file during insert

Codeigniter CSV upload then explode

Perl Out Of Memory

Categories

Resources