So I have this script that scrape data to a website, its getting and downloading a CSV and its process the CSV row by row and converts it into TSV, once that finished the TSV file will be converted into a HTML file. I'm done the rest of that but the output that I'm getting is some what messed up, the script goes to different table pages on the source site and downloads a dynamically generated CSV file; that CSV file is then turned into a TSV file that we then turn into HTML. The CSV file seems to be sorted by the first column for each row that is returned but not based on any of the other columns in the same row. Therefore what is happening is that entries with the same first column values can be jumbled up from one download to the next download of the same file.
A visual representation of only sorting by the first column this follows with numbers representing column data:
1st Download:
1-1
1-2
1-3
2-1
2-2
2-3
3-1
3-2
3-3
2nd Download:
1-1
1-3
1-2
2-2
2-1
2-3
3-3
3-2
3-1
So what I have in mind is the process will be like this, download the CSV file from the source and then perform a sort on the lines in that CSV file to normalize them for comparison to one another before writing the TSV or HTML files. This should allow for accurate comparison for updated data files. but I didn't know how to do this my logic is like this
So I will put the function between the 1. and 2. before it process the CSV file into TSV File I want the content of the CSV is already sorted.
So my script is looking like this
my $download_dir_link ="C:/Users/jabella/Downloads";
unlink("$download_dir_link/Product Classification List.csv");
#CHECK IF CSV FILE DOWNLOAD IS FINISHED
my $complete_download_flag = 0;
while($complete_download_flag == 0)
{
my #download_directory = read_dir($download_dir_link);
foreach my $downloaded_file (#download_directory)
{
if($downloaded_file =~ /\QProduct Classification List.csv\E/sgi)
{
$complete_download_flag = 1;
}
}
sleep(5);
}
#SORTED CONTENTS OF CSV BEFORE CONVERSION
print "sORTING csv content...\n";
#CONVERT CSV TO TSV
print "Converting csv to tsv...\n";
my $csv = Text::CSV->new ({ binary => 1 });
my $tsv = Text::CSV->new ({ binary => 1, sep_char => "\t", eol => "\n"});
open my $infh, "<:encoding(utf8)", "$download_dir_link/Product Classification List.csv";
open my $outfh, ">:encoding(utf8)", "Product Classification List.tsv";
while (my $row = $csv->getline ($infh))
{
$tsv->print ($outfh, $row);
}
close($infh);
close($outfh);
my $tsv_content = "";
open(my $fh, '<', "Product Classification List.tsv");
while (<$fh>)
{
$tsv_content = $tsv_content.$_;
}
close($fh);
print "Conversion complete! cleaning tsv content...\n";
#CLEAN TSV CONTENT
$tsv_content =~ s/(.*?)\t"(.*?)"\t"(.*?)"\t"(.*?)"\t(.*?)\t"(.*?)"\t(.*)/<tr><th>$1<\/th><th>$2<\/th><th>$3<\/th><th>$4<\/th><th>$5<\/th><th>$6<\/th><th>$7<\/th><\/tr>/gi;
$tsv_content =~ s/"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\n/<tr><td>$1<\/td><td>$2<\/td><td>$3<\/td><td>$4<\/td><td>$5<\/td><td>$6<\/td><td>$7<\/td><\/tr>\n/gi;
$tsv_content =~ s/\"{2}/\"/sgi;
$tsv_content =~ s/(<\/tr>)\n?"/$1/sgi;
$tsv_content =~ s/\s{2,}/ /sgi;
$tsv_content =~ s/.*?(<tr>)/$1/si;
$tsv_content = "<table>\n$tsv_content</table>";
$classification =~ s/_//sgi;
if(exists $existing_index_hash{$doc_uid."_pind.html"})
{
if($existing_index_hash{$doc_uid."_pind.html"} ne $tsv_content)
{
$changed_flag = "1";
$updated_files = $updated_files."-$classification\n";
print "Updated: $classification\n";
Hope someone here can help me on this thank you
Here is a simple script that loads a CSV file specified as an argument and outputs it sorted by the first two columns.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = 'Text::CSV_XS'->new({binary => 1, auto_diag => 1});
open my $in, '<', shift or die $!;
my #rows;
while (my $row = $csv->getline($in)) {
push #rows, $row;
}
# Here the sorting happens. Compare the first column,
# if the values are the same, compare the second column.
#rows = sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } #rows;
$csv->say(*STDOUT, $_) for #rows;
You can use the following to sort by all columns (but it compares the values as strings, it doesn't work for numbers):
sub by_all {
my ($n, $A, $B) = #_;
$A->[$n] cmp $B->[$n]
|| $n < $#$A && by_all($n + 1, $A, $B)
}
sort { by_all(0, $a, $b) } #rows;
To make it work for numbers, too, you can let Perl guess what is a number:
use Scalar::Util qw{ looks_like_number };
sub by_all {
my ($n, $A, $B) = #_;
(looks_like_number($A->[$n])
? $A->[$n] <=> $B->[$n]
: $A->[$n] cmp $B->[$n]
) || $n < $#$A && by_all($n + 1, $A, $B)
}
I am trying to get read value from a file in shell script based on input. How I can do it ?
I tried in different way but not able to read value in below file.
# dates
dates:
start: '10-09-2018'
end: '10-02-2019'
# filters
filters:
- table: 'employee'
column: 'sex'
operation: '=='
value: M
- table: 'employee'
column: 'department'
operation: 'notin'
value: ['operation', 'sales']
- table: 'organisation'
column: 'org_id'
operation: '=='
value: 124
- table: 'organisation'
column: 'org_name'
operation: '=='
value: XYZ LIMITED
My expected output,
If I pass 'filters' and table name 'employee' as input, I should get output as sex='M' and department <> 'operation' and department <> 'sales'
Can anyone help me on this.
So, assuming you fix the indentation in that example so it's valid YAML, the following perl script produces what you want:
#!/usr/bin/perl
use warnings;
use strict;
use YAML::XS qw/LoadFile/;
my ($file, $mapping, $table) = #ARGV;
my $yaml = LoadFile $file;
die "No such mapping: $mapping\n" unless exists $yaml->{$mapping};
my %ops = ('==' => '=', 'notin' => '<>');
my $first = 1;
for my $elem (#{$yaml->{$mapping}}) {
if ($elem->{'table'} eq $table) {
my $col = $elem->{'column'};
my $value = $elem->{'value'};
my $op = $elem->{'operation'};
$op = $ops{$op} // $op;
print ' and ' unless $first;
$first = 0;
if (ref $value eq 'ARRAY') {
print join(" and ", map { "$col $op '$_'" } #$value);
} else {
print "$col $op '$value'";
}
}
}
print "\n";
Example:
$ perl boringname.pl example.yaml filters employee
sex = 'M' and department <> 'operation' and department <> 'sales'
Requires the YAML::XS module, installable through your favorite CPAN client or your OS package manager (Ubuntu calls the package libyaml-libyaml-perl. Not sure about others.).
ajax driven search(date range is input).
search is working properly however when I am trying to convert status(value 1 & 2 is stored in database) as ON or OFF. I'm getting this error, when I change if ('.$row->status.' === '1' ) to if ('.$row->status.' === 1 ) output is
if (1 === 1 ) else if (2 === 1 ) else if (1 === 1 ) else
foreach ($query as $row)
{
$output .= '
<tr>
if ('.$row->status.' === '1' ) //err
<td>ON</td>
else
<td>OFF</td>
</tr>
You have to go through something like the following thought process...
"I want to output a string "On" or "Off" based on the result of $row->status.
So first you need to determine what the string , let's call it $on_off.
Then you need to create your string segment based upon $on_off.
foreach ($query as $row)
{
// Determine the ON/OFF string to output
$on_off = ($row->status)?'ON':'OFF';
// Create the String Segment.
$output .= '<tr><td>';
$output .= $on_off;
$output .= '</td></tr>';
// Whatever else is here...
}
The reason I have broken $output into 3 commands is to help break up the table tags into something a bit more readable and less prone to mistakes.
As an example:
I load in the input from a .txt:
Benjamin,Schuvlein,Germany,1912,M,White
I do some code that I will not post here for brevity and get to the link:
https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ
I want to scrape multiple things from that page. In the code below, I only do 1.
I'd also like to make each item be separated by a , in the output .txt.
And, I'd like the output to be preceded by the input.
I'm using the following packages in the code:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
Here's the relevant code:
my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ';
print "My URL is <$url>\n";
my $request = HTTP::Request->new(GET => $url);
$request->push_header('Content-Type' => 'application/json');
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
my $dom_tree = new HTML::DOM;
$dom_tree->write($response->content);
$dom_tree->close;
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
print $str;
print $o $str;
Desired Output (from that link) is something like:
Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....
(How much of that output section is scrapable?)
Any help on how to get the link within the link would be much appreciated!
This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.
I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.
X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
Am I to expect an email from you? I replied to your first mail but haven't heard since.
use strict;
use warnings;
use URI;
use LWP;
use HTML::TreeBuilder::XPath;
my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my #results = $tree->findnodes('//table[#class="result-data"]//tr[#class="result-item"]');
my %data;
for my $item (#results) {
my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
$key =~ s/:$//;
$data{$key} = $val;
}
my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
'event place', 'marital status', 'residence in 1935',
'relationship to head of household (standardized)';
print $record, "\n";
output
"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head
Try this
use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;
$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;
Which prints out
event place: = Assembly District 2, Queens, New York City, Queens, New York, United States
I also noticed this header:
X-Copyright:COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
See also http://metacpan.org/pod/HTML::Element#SYNOPSIS
I thought I had answered your question.
The problem is that you are trying to fetch the webpage with LWP. Why are try to doing that if you already have WWW::Mechanize::Firefox?
Did you tried this?
It will retrieve and save each link for further analyses. A small change and you 'get' the DOM tree. Sorry, I do not have acccess to this page, so I just hope it will work.
my $i=1;
for my $link (#links) {
print Dumper $link->url;
print Dumper $link->text;
my $tempfile = './$i.html';$i++;
$mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
my $dom_tree = $mech->document();
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();
}
EDIT:
Process the page content with regexp (Everyone: Please remember, there is always more than one way to do something wwith Perl!. It works, it is easy...)
it tried it out with this cmd:
wget -nd 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ' -O 1.html|cat 1.html|1.pl
use Data::Dumper;
use strict;
use warnings;
local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
my $td = $1;
print "td: $td\n";
my $td_val = $td;
$td_val =~ s!<[^>]*>!!gis;
$td_val =~ s!\s+! !gs;
$td_val =~ s!(\A\s+|\s+\z)!!gs;
if ($td =~ m!result-label!){ #primitive state machine, store the current label
print "current_label: $current_label\n";
$current_label = $td_val;
} elsif ($td =~ m!result-value!){ #add each data to current label
push(#{$data->{$current_label}},$td_val);
} else {
warn "found something else: $td\n";
}
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
process_multi($type,$data->{$type});
delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
process_simple($type,$data->{$type});
delete($data->{$type});
}
die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;
Output:
'line number:' => [
'28'
],
'estimated birth year:' => [
'1912'
],
'head' => [
'Benjamin Schuvlein',
'M',
'28',
'Germany'
],
I have a problem with one field of the DB. With this code:
$expireMonth = Mage::getStoreConfig('points_options/config_points/expiration_period', Mage::app()->getStore()->getId());
if (!is_null($expireMonth) && ($expireMonth > 0)) {
$expireDate = date("Y-m-d H:i:s", strtotime("+" . $expireMonth . " month"));
} else {
$expireDate = NULL;
}
//die($expireDate);
//store in points history table
$this->_pointsModel->setCustomerId($this->_customer->getId())
->setOrdersId('welcome')
->setPointsPending($pointsForNewCustomer)
->setPointsComment(Mage::helper('points')->__('welcome points'))
->setDateAdded(date('Y-m-d H:i:s'))
->setPointsStatus(2)//confirmed
->setPointsType('WE')
->setStoreId(Mage::app()->getStore()->getId())
->setExpireDate($expireDate)
->save();
Every field is saved in the table, except for expire_date. If I uncomment the die($expireData), I see the correct value, something like 2012-01-13 13:21:12. The field is defined as:
`expire_date` datetime NULL
Any thoughts?
edit: The solution is:
$expireDate = date("Y-m-d H:i:s", strtotime("+" . $expireMonth . " months"));
Check out the "s" in my strtotime expression.
I know many attributes allow for some sort of formatting before writing and after reading. Have you tried setting the value as a unix timestamp instead?