How do I handle Unicode with DBD::Oracle? - oracle

The perl DBI documentation says this :
Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.
DBD::Sqlite with parameter (sqlite_unicode => 1), or DBD::Pg with parameter (pg_enable_utf8 => -1) -- which is the default -- indeed do such conversions.
With DBD::Oracle (v1.83, NLS_LANG='FRENCH_FRANCE.UTF8') it is not so : if non-Unicode strings are passed to INSERT or UPDATE statements, the driver does not upgrade them automatically to utf8.
Here is my test suite. Variants for SQLite and Pg succeed, but this Oracle variant fails :
use utf8;
use strict;
use warnings;
use Test::More;
use SQL::Abstract::More;
use Scalar::Util qw/looks_like_number/;
use DBI;
my #DBI_CONNECT_ARGS = #ARGV;
my ($table, $key_col, $val_col) = qw/TST_UTF8 KEY VAL/; # assuming this table is already created
binmode $_, ':utf8' for *STDERR, *STDOUT;
# strings for tests
my %str;
$str{utf8} = "il était une bergère"; # has flag utf8 because of 'use utf8'
$str{native} = $str{utf8}; utf8::downgrade($str{native}); # without flag utf8
$str{wide_chars} = "il était une bergère♥♡"; # chars > 256 - cannot be a native string (\x{2665}\x{2661})
$str{named_chars} = "il \N{LATIN SMALL LETTER E WITH ACUTE}tait une " # identical to string 'wide_chars'
. "berg\N{LATIN SMALL LETTER E WITH GRAVE}re"
. "\N{BLACK HEART SUIT}\N{WHITE HEART SUIT}";
# check that test strings meet expectations
ok utf8::is_utf8($str{utf8}), "perl string with utf8 flag";
ok !utf8::is_utf8($str{native}), "perl string without utf8 flag, (native chars ... latin1)";
is $str{utf8}, $str{native}, "strings 'utf8' and 'native' have different encodings but represent the same chars";
ok utf8::is_utf8($str{wide_chars}), "string with wide chars must have utf8 flag";
ok utf8::is_utf8($str{named_chars}), "string with named wide chars must have utf8 flag";
is $str{wide_chars}, $str{named_chars}, "named chars are identical to chars from perl source";
my $dbh = DBI->connect(#DBI_CONNECT_ARGS);
my $sqlam = SQL::Abstract::More->new;
my ($sql, #bind);
# suppress records from previous run
my #k = keys %str;
($sql, #bind) = $sqlam->delete(-from => $table, -where => {$key_col => {-in => \#k}});
my $del = $dbh->do($sql, {}, #bind);
note "DELETED $del records";
# insert strings via bind values
while (my ($key, $val) = each %str) {
($sql, #bind) = $sqlam->insert(-into => $table, -values => {$key_col => $key, $val_col => $val});
my $ins = $dbh->do($sql, {}, #bind);
note "INSERT via bind $key: $ins";
}
# read data back
($sql, #bind) = $sqlam->select(-from => $table,
-columns => [$key_col, $val_col],
-where => {$key_col => {-in => \#k}});
my $rows = $dbh->selectall_arrayref($sql, {}, #bind);
my %str_from_db = map {#$_} #$rows;
# check round trip
is_deeply \%str_from_db, \%str, 'round trip with bind values';
# suppress again
($sql, #bind) = $sqlam->delete(-from => $table, -where => {$key_col => {-in => \#k}});
$del = $dbh->do($sql, {}, #bind);
note "DELETED $del records";
# insert strings via raw sql
while (my ($key, $val) = each %str) {
my $ins = $dbh->do("INSERT INTO $table($key_col, $val_col) VALUES ('$key', '$val')");
note "INSERT via raw SQL $key: $ins";
}
# check round trip
is_deeply \%str_from_db, \%str, 'round trip with raw SQL';
As a workaround, I added some callbacks for automatic upgrading of native strings; with this addition the tests pass :
$dbh->{Callbacks}{prepare} = sub {
# warn "PREPARE : upgrading stmt: $_[1]\n";
utf8::upgrade($_[1]);
return;
};
$dbh->{Callbacks}{ChildCallbacks}{execute} = sub {
# warn "EXECUTE: ";
foreach my $i (1 .. $#_) {
if ($_[$i] && ! ref $_[$i] && ! looks_like_number(($_[$i]))) {
# warn "upgrading $i : $_[$i];";
utf8::upgrade($_[$i]);
}
}
print STDERR "\n";
return;
};
If I understand properly the DBI spec, this automatic upgrade should be performed by the DBD::Oracle driver, not by the application code. Or am i missing something ?

Related

How to read value from a file in shell scripting?

I am trying to get read value from a file in shell script based on input. How I can do it ?
I tried in different way but not able to read value in below file.
# dates
dates:
start: '10-09-2018'
end: '10-02-2019'
# filters
filters:
- table: 'employee'
column: 'sex'
operation: '=='
value: M
- table: 'employee'
column: 'department'
operation: 'notin'
value: ['operation', 'sales']
- table: 'organisation'
column: 'org_id'
operation: '=='
value: 124
- table: 'organisation'
column: 'org_name'
operation: '=='
value: XYZ LIMITED
My expected output,
If I pass 'filters' and table name 'employee' as input, I should get output as sex='M' and department <> 'operation' and department <> 'sales'
Can anyone help me on this.
So, assuming you fix the indentation in that example so it's valid YAML, the following perl script produces what you want:
#!/usr/bin/perl
use warnings;
use strict;
use YAML::XS qw/LoadFile/;
my ($file, $mapping, $table) = #ARGV;
my $yaml = LoadFile $file;
die "No such mapping: $mapping\n" unless exists $yaml->{$mapping};
my %ops = ('==' => '=', 'notin' => '<>');
my $first = 1;
for my $elem (#{$yaml->{$mapping}}) {
if ($elem->{'table'} eq $table) {
my $col = $elem->{'column'};
my $value = $elem->{'value'};
my $op = $elem->{'operation'};
$op = $ops{$op} // $op;
print ' and ' unless $first;
$first = 0;
if (ref $value eq 'ARRAY') {
print join(" and ", map { "$col $op '$_'" } #$value);
} else {
print "$col $op '$value'";
}
}
}
print "\n";
Example:
$ perl boringname.pl example.yaml filters employee
sex = 'M' and department <> 'operation' and department <> 'sales'
Requires the YAML::XS module, installable through your favorite CPAN client or your OS package manager (Ubuntu calls the package libyaml-libyaml-perl. Not sure about others.).

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.
My target data volume range is about 20-30 GB in uncompressed flat files.
In terms of lines, a file can contain anywhere from 18 million to 28 million lines.
It has around 40-50 columns per line.
I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.
comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).
All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.
Now I am trying to bring down the total processing time even further.
Here is my current single threaded approach using perl5.
fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).
I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?
What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.
I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.
Thanks.
EDIT
As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:
I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
Is it possible to compare the data (stored in hash) in parallel?
My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.
package COMP;
use strict;
use Data::Dumper;
sub comp
{
my ($data,$src,$tgt) = #_;
my $result = {};
my $ms = ($result->{ms} = {});
my $mt = ($result->{mt} = {});
my $diff = ($result->{diff} = {});
foreach my $key (keys %{$data->{$src}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$tgt}{$key}) {
push (#{$mt->{$key}}, "$src_val|NULL");
}
if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
push (#{$diff->{$key}}, "$src_val|$tgt_val")
}
}
foreach my $key (keys %{$data->{$tgt}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$src}{$key}) {
push (#{$ms->{$key}},"NULL|$tgt_val");
}
}
return $result;
}
1;
If someone would like to try it out, here is the sample output and the test script used.
script output
[User#Host:]$ perl testCOMP.pl
$VAR1 = {
'mt' => {
'Source' => [
'source|NULL'
]
},
'ms' => {
'Target' => [
'NULL|target'
]
},
'diff' => {
'Sunday_isit' => [
'Yes|No'
]
}
};
Test Script
[User#Host:]$ cat testCOMP.pl
#!/usr/bin/env perl
use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;
my $data2 = {
f1 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'Yes',
Source => 'source',
},
f2 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'No',
Target => 'target',
},
};
my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User#Host:]$
If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider
allnum.pl6
use v6;
my #processes =
[ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
.map( -> $filename {
[ $filename, run "perl", "num.pl", $filename, :out ];
})
.hyper;
say "Lazyness Here!";
my $time = time;
for #processes
{
say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";
num.pl
use warnings;
use strict;
my $file = shift #ARGV;
my $start = time;
my $result = 0;
open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
chomp $thing;
$thing =~ s/ //g;
$result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";
On my system
C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s
Files were set up like so
C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt

Parsing a string field

I have these Syslog messages:
N 4000000 PROD 15307 23:58:12.13 JOB78035 00000000 $HASP395 GGIVJS27 ENDED\r
NI0000000 PROD 15307 23:58:13.41 STC81508 00000200 $A J78036 /* CA-JOBTRAC JOB RELEASE */\r
I would like to parse these messages into various fields in a Hash, e.g.:
event['recordtype'] #=> "N"
event['routingcode'] #=> "4000000"
event['systemname'] #=> "PROD"
event['datetime'] #=> "15307 23:58:12.13"
event['jobid'] #=> "JOB78035"
event['flag'] #=> "00000000"
event['messageid'] #=> "$HASP395"
event['logmessage'] #=> "$HASP395 GGIVJS27 ENDED\r"
This is the code I have currently:
message = event["message"];
if message.to_s != "" then
if message[2] == " " then
array = message.split(%Q[ ]);
event[%q[recordtype]] = array[0];
event[%q[routingcode]] = array[1];
event[%q[systemname]] = array[2];
event[%q[datetime]] = array[3] + " " +array[4];
event[%q[jobid]] = message[38,8];
event[%q[flags]] = message[47,8];
event[%q[messageid]] = message[57,8];
event[%q[logmessage]] = message[56..-1];
else
array = message.split(%Q[ ]);
event[%q[recordtype]] = array[0][0,2];
event[%q[routingcode]] = array[0][2..-1];
event[%q[systemname]] = array[1];
event[%q[datetime]] = array[2] + " "+array[3];
event[%q[jobid]] = message[38,8];
event[%q[flags]] = message[47,8];
event[%q[messageid]] = message[57,8];
event[%q[logmessage]] = message[56..-1];
end
end
I'm looking to improve the above code. I think I could use a regular expression, but I don't know how to approach it.
You can't use split(' ') or a default split to process your fields because you are dealing with columnar data that has fields that have no whitespace between them, resulting in your array being off. Instead, you have to pick apart each record by columns.
There are many ways to do that but the simplest and probably fastest, is indexing into a string and grabbing n characters:
'foo'[0, 1] # => "f"
'foo'[1, 2] # => "oo"
The first means "starting at index 0 in the string, grab one character." The second means "starting at index 1 in the string, grab two characters."
Alternately, you could tell Ruby to extract by ranges:
'foo'[0 .. 0] # => "f"
'foo'[1 .. 2] # => "oo"
These are documented in the String class.
This makes writing code that's easily understood:
record_type = message[ 0 .. 1 ].rstrip
routing_code = message[ 2 .. 8 ]
system_name = message[ 10 .. 17 ]
Once you have your fields captured add them to a hash:
{
'recordtype' => record_type,
'routingcode' => routing_code,
'systemname' => system_name,
'datetime' => date_time,
'jobid' => job_id,
'flags' => flags,
'messageid' => message_id,
'logmessage' => log_message,
}
While you could use a regular expression there's not much gained using one, it's just another way of doing it. If you were picking data out of free-form text it'd be more useful, but in columnar data it tends to result in visual noise that makes maintenance more difficult. I'd recommend simply determining your columns then cutting the data you need based on those from each line.

Parsing string into ARGV equivalent (Windows and Perl)

Edit - Answer posted below
I have a script that usually uses #ARGV arguments but in some cases it is invoked by another script (which I cannot modify) that instead only passes a config filename which among other things has the command line options that should have been passed directly.
Example:
Args=--test --pdf "C:\testing\my pdf files\test.pdf"
If possible I'd like a way to parse this string into an array that would be identical to #ARGV.
I have a workaround where I setup an external perl script that just echos #ARGV, and I invoke this script like below (standard boilerplate removed).
echo-args.pl
print join ("\n", #ARGV);
test-echo-args.pl
$my_args = '--test --pdf "C:\testing\my pdf files\test.pdf"';
#args = map { chomp ; $_ } `perl echo-args.pl $my_args`;
This seems inelegant but it works. Is there a better way without invoking a new process? I did try splitting and processing but there are some oddities on the command line e.g. -a"b c" becomes '-ab c' and -a"b"" becomes -ab" and I'd rather not worry about edge cases but I know that'll bite me one day if I don't.
Answer - thanks ikegami!
I've posted a working program below that uses Win32::API and CommandLineToArgvW from shell32.dll based on ikegami's advice. It is intentionally verbose in the hopes that it'll be more easy to follow for anyone like myself who is extremely rusty with C and pointer arithmetic.
Any tips are welcome, apart from the obvious simplifications :)
use strict;
use warnings;
use Encode qw( encode decode );
use Win32::API qw( );
use Data::Dumper;
# create a test argument string, with some variations, and pack it
# apparently an empty string returns $^X which is documented so check before calling
my $arg_string = '--test 33 -3-t" "es 33\t2 ';
my $packed_arg_string = encode('UTF-16le', $arg_string."\0");
# create a packed integer buffer for output
my $packed_argc_buf_ptr = pack('L', 0);
# create then call the function and get the result
my $func = Win32::API->new('shell32.dll', 'CommandLineToArgvW', 'PP', 'N')
or die $^E;
my $ret = $func->Call($packed_arg_string, $packed_argc_buf_ptr);
# unpack to get the number of parsed arguments
my $argc = unpack('L', $packed_argc_buf_ptr);
print "We parsed $argc arguments\n";
# parse the return value to get the actual strings
my #argv = decode_LPWSTR_array($ret, $argc);
print Dumper \#argv;
# try not to leak memory
my $local_free = Win32::API->new('kernel32.dll', 'LocalFree', 'N', '')
or die $^E;
$local_free->Call($ret);
exit;
sub decode_LPWSTR_array {
my ($ptr, $num) = #_;
return undef if !$ptr;
# $ptr is the memory location of the array of strings (i.e. more pointers)
# $num is how many we need to get
my #strings = ();
for (1 .. $num) {
# convert $ptr to a long, using that location read 4 bytes - this is the pointer to the next string
my $string_location = unpack('P4', pack('L', $ptr));
# make it human readable
my $readable_string_location = unpack('L', $string_location);
# decode the string and save it for later
push(#strings, decode_LPCWSTR($readable_string_location));
# our pointers are 32-bit
$ptr += 4;
}
return #strings;
}
# Copied from http://stackoverflow.com/questions/5529928/perl-win32api-and-pointers
sub decode_LPCWSTR {
my ($ptr) = #_;
return undef if !$ptr;
my $sW = '';
for (;;) {
my $chW = unpack('P2', pack('L', $ptr));
last if $chW eq "\0\0";
$sW .= $chW;
$ptr += 2;
}
return decode('UTF-16le', $sW);
}
In unix systems, it's the shell that parses that shell command into strings. But in Windows, it's up to each application. I think this is normally done using the CommandLineToArgv system call (which you could call with the help of Win32::API), but the spec is documented here if you want to reimplement it yourself.

Scraping multiple items off of a Page into a Neat Row

As an example:
I load in the input from a .txt:
Benjamin,Schuvlein,Germany,1912,M,White
I do some code that I will not post here for brevity and get to the link:
https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ
I want to scrape multiple things from that page. In the code below, I only do 1.
I'd also like to make each item be separated by a , in the output .txt.
And, I'd like the output to be preceded by the input.
I'm using the following packages in the code:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
Here's the relevant code:
my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ';
print "My URL is <$url>\n";
my $request = HTTP::Request->new(GET => $url);
$request->push_header('Content-Type' => 'application/json');
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
my $dom_tree = new HTML::DOM;
$dom_tree->write($response->content);
$dom_tree->close;
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
print $str;
print $o $str;
Desired Output (from that link) is something like:
Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....
(How much of that output section is scrapable?)
Any help on how to get the link within the link would be much appreciated!
This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.
I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.
X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
Am I to expect an email from you? I replied to your first mail but haven't heard since.
use strict;
use warnings;
use URI;
use LWP;
use HTML::TreeBuilder::XPath;
my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my #results = $tree->findnodes('//table[#class="result-data"]//tr[#class="result-item"]');
my %data;
for my $item (#results) {
my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
$key =~ s/:$//;
$data{$key} = $val;
}
my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
'event place', 'marital status', 'residence in 1935',
'relationship to head of household (standardized)';
print $record, "\n";
output
"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head
Try this
use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;
$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;
Which prints out
event place: = Assembly District 2, Queens, New York City, Queens, New York, United States
I also noticed this header:
X-Copyright:COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
See also http://metacpan.org/pod/HTML::Element#SYNOPSIS
I thought I had answered your question.
The problem is that you are trying to fetch the webpage with LWP. Why are try to doing that if you already have WWW::Mechanize::Firefox?
Did you tried this?
It will retrieve and save each link for further analyses. A small change and you 'get' the DOM tree. Sorry, I do not have acccess to this page, so I just hope it will work.
my $i=1;
for my $link (#links) {
print Dumper $link->url;
print Dumper $link->text;
my $tempfile = './$i.html';$i++;
$mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
my $dom_tree = $mech->document();
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();
}
EDIT:
Process the page content with regexp (Everyone: Please remember, there is always more than one way to do something wwith Perl!. It works, it is easy...)
it tried it out with this cmd:
wget -nd 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ' -O 1.html|cat 1.html|1.pl
use Data::Dumper;
use strict;
use warnings;
local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
my $td = $1;
print "td: $td\n";
my $td_val = $td;
$td_val =~ s!<[^>]*>!!gis;
$td_val =~ s!\s+! !gs;
$td_val =~ s!(\A\s+|\s+\z)!!gs;
if ($td =~ m!result-label!){ #primitive state machine, store the current label
print "current_label: $current_label\n";
$current_label = $td_val;
} elsif ($td =~ m!result-value!){ #add each data to current label
push(#{$data->{$current_label}},$td_val);
} else {
warn "found something else: $td\n";
}
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
process_multi($type,$data->{$type});
delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
process_simple($type,$data->{$type});
delete($data->{$type});
}
die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;
Output:
'line number:' => [
'28'
],
'estimated birth year:' => [
'1912'
],
'head' => [
'Benjamin Schuvlein',
'M',
'28',
'Germany'
],

Resources