I am trying to get read value from a file in shell script based on input. How I can do it ?
I tried in different way but not able to read value in below file.
# dates
dates:
start: '10-09-2018'
end: '10-02-2019'
# filters
filters:
- table: 'employee'
column: 'sex'
operation: '=='
value: M
- table: 'employee'
column: 'department'
operation: 'notin'
value: ['operation', 'sales']
- table: 'organisation'
column: 'org_id'
operation: '=='
value: 124
- table: 'organisation'
column: 'org_name'
operation: '=='
value: XYZ LIMITED
My expected output,
If I pass 'filters' and table name 'employee' as input, I should get output as sex='M' and department <> 'operation' and department <> 'sales'
Can anyone help me on this.
So, assuming you fix the indentation in that example so it's valid YAML, the following perl script produces what you want:
#!/usr/bin/perl
use warnings;
use strict;
use YAML::XS qw/LoadFile/;
my ($file, $mapping, $table) = #ARGV;
my $yaml = LoadFile $file;
die "No such mapping: $mapping\n" unless exists $yaml->{$mapping};
my %ops = ('==' => '=', 'notin' => '<>');
my $first = 1;
for my $elem (#{$yaml->{$mapping}}) {
if ($elem->{'table'} eq $table) {
my $col = $elem->{'column'};
my $value = $elem->{'value'};
my $op = $elem->{'operation'};
$op = $ops{$op} // $op;
print ' and ' unless $first;
$first = 0;
if (ref $value eq 'ARRAY') {
print join(" and ", map { "$col $op '$_'" } #$value);
} else {
print "$col $op '$value'";
}
}
}
print "\n";
Example:
$ perl boringname.pl example.yaml filters employee
sex = 'M' and department <> 'operation' and department <> 'sales'
Requires the YAML::XS module, installable through your favorite CPAN client or your OS package manager (Ubuntu calls the package libyaml-libyaml-perl. Not sure about others.).
Related
The perl DBI documentation says this :
Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.
DBD::Sqlite with parameter (sqlite_unicode => 1), or DBD::Pg with parameter (pg_enable_utf8 => -1) -- which is the default -- indeed do such conversions.
With DBD::Oracle (v1.83, NLS_LANG='FRENCH_FRANCE.UTF8') it is not so : if non-Unicode strings are passed to INSERT or UPDATE statements, the driver does not upgrade them automatically to utf8.
Here is my test suite. Variants for SQLite and Pg succeed, but this Oracle variant fails :
use utf8;
use strict;
use warnings;
use Test::More;
use SQL::Abstract::More;
use Scalar::Util qw/looks_like_number/;
use DBI;
my #DBI_CONNECT_ARGS = #ARGV;
my ($table, $key_col, $val_col) = qw/TST_UTF8 KEY VAL/; # assuming this table is already created
binmode $_, ':utf8' for *STDERR, *STDOUT;
# strings for tests
my %str;
$str{utf8} = "il était une bergère"; # has flag utf8 because of 'use utf8'
$str{native} = $str{utf8}; utf8::downgrade($str{native}); # without flag utf8
$str{wide_chars} = "il était une bergère♥♡"; # chars > 256 - cannot be a native string (\x{2665}\x{2661})
$str{named_chars} = "il \N{LATIN SMALL LETTER E WITH ACUTE}tait une " # identical to string 'wide_chars'
. "berg\N{LATIN SMALL LETTER E WITH GRAVE}re"
. "\N{BLACK HEART SUIT}\N{WHITE HEART SUIT}";
# check that test strings meet expectations
ok utf8::is_utf8($str{utf8}), "perl string with utf8 flag";
ok !utf8::is_utf8($str{native}), "perl string without utf8 flag, (native chars ... latin1)";
is $str{utf8}, $str{native}, "strings 'utf8' and 'native' have different encodings but represent the same chars";
ok utf8::is_utf8($str{wide_chars}), "string with wide chars must have utf8 flag";
ok utf8::is_utf8($str{named_chars}), "string with named wide chars must have utf8 flag";
is $str{wide_chars}, $str{named_chars}, "named chars are identical to chars from perl source";
my $dbh = DBI->connect(#DBI_CONNECT_ARGS);
my $sqlam = SQL::Abstract::More->new;
my ($sql, #bind);
# suppress records from previous run
my #k = keys %str;
($sql, #bind) = $sqlam->delete(-from => $table, -where => {$key_col => {-in => \#k}});
my $del = $dbh->do($sql, {}, #bind);
note "DELETED $del records";
# insert strings via bind values
while (my ($key, $val) = each %str) {
($sql, #bind) = $sqlam->insert(-into => $table, -values => {$key_col => $key, $val_col => $val});
my $ins = $dbh->do($sql, {}, #bind);
note "INSERT via bind $key: $ins";
}
# read data back
($sql, #bind) = $sqlam->select(-from => $table,
-columns => [$key_col, $val_col],
-where => {$key_col => {-in => \#k}});
my $rows = $dbh->selectall_arrayref($sql, {}, #bind);
my %str_from_db = map {#$_} #$rows;
# check round trip
is_deeply \%str_from_db, \%str, 'round trip with bind values';
# suppress again
($sql, #bind) = $sqlam->delete(-from => $table, -where => {$key_col => {-in => \#k}});
$del = $dbh->do($sql, {}, #bind);
note "DELETED $del records";
# insert strings via raw sql
while (my ($key, $val) = each %str) {
my $ins = $dbh->do("INSERT INTO $table($key_col, $val_col) VALUES ('$key', '$val')");
note "INSERT via raw SQL $key: $ins";
}
# check round trip
is_deeply \%str_from_db, \%str, 'round trip with raw SQL';
As a workaround, I added some callbacks for automatic upgrading of native strings; with this addition the tests pass :
$dbh->{Callbacks}{prepare} = sub {
# warn "PREPARE : upgrading stmt: $_[1]\n";
utf8::upgrade($_[1]);
return;
};
$dbh->{Callbacks}{ChildCallbacks}{execute} = sub {
# warn "EXECUTE: ";
foreach my $i (1 .. $#_) {
if ($_[$i] && ! ref $_[$i] && ! looks_like_number(($_[$i]))) {
# warn "upgrading $i : $_[$i];";
utf8::upgrade($_[$i]);
}
}
print STDERR "\n";
return;
};
If I understand properly the DBI spec, this automatic upgrade should be performed by the DBD::Oracle driver, not by the application code. Or am i missing something ?
So I have this script that scrape data to a website, its getting and downloading a CSV and its process the CSV row by row and converts it into TSV, once that finished the TSV file will be converted into a HTML file. I'm done the rest of that but the output that I'm getting is some what messed up, the script goes to different table pages on the source site and downloads a dynamically generated CSV file; that CSV file is then turned into a TSV file that we then turn into HTML. The CSV file seems to be sorted by the first column for each row that is returned but not based on any of the other columns in the same row. Therefore what is happening is that entries with the same first column values can be jumbled up from one download to the next download of the same file.
A visual representation of only sorting by the first column this follows with numbers representing column data:
1st Download:
1-1
1-2
1-3
2-1
2-2
2-3
3-1
3-2
3-3
2nd Download:
1-1
1-3
1-2
2-2
2-1
2-3
3-3
3-2
3-1
So what I have in mind is the process will be like this, download the CSV file from the source and then perform a sort on the lines in that CSV file to normalize them for comparison to one another before writing the TSV or HTML files. This should allow for accurate comparison for updated data files. but I didn't know how to do this my logic is like this
So I will put the function between the 1. and 2. before it process the CSV file into TSV File I want the content of the CSV is already sorted.
So my script is looking like this
my $download_dir_link ="C:/Users/jabella/Downloads";
unlink("$download_dir_link/Product Classification List.csv");
#CHECK IF CSV FILE DOWNLOAD IS FINISHED
my $complete_download_flag = 0;
while($complete_download_flag == 0)
{
my #download_directory = read_dir($download_dir_link);
foreach my $downloaded_file (#download_directory)
{
if($downloaded_file =~ /\QProduct Classification List.csv\E/sgi)
{
$complete_download_flag = 1;
}
}
sleep(5);
}
#SORTED CONTENTS OF CSV BEFORE CONVERSION
print "sORTING csv content...\n";
#CONVERT CSV TO TSV
print "Converting csv to tsv...\n";
my $csv = Text::CSV->new ({ binary => 1 });
my $tsv = Text::CSV->new ({ binary => 1, sep_char => "\t", eol => "\n"});
open my $infh, "<:encoding(utf8)", "$download_dir_link/Product Classification List.csv";
open my $outfh, ">:encoding(utf8)", "Product Classification List.tsv";
while (my $row = $csv->getline ($infh))
{
$tsv->print ($outfh, $row);
}
close($infh);
close($outfh);
my $tsv_content = "";
open(my $fh, '<', "Product Classification List.tsv");
while (<$fh>)
{
$tsv_content = $tsv_content.$_;
}
close($fh);
print "Conversion complete! cleaning tsv content...\n";
#CLEAN TSV CONTENT
$tsv_content =~ s/(.*?)\t"(.*?)"\t"(.*?)"\t"(.*?)"\t(.*?)\t"(.*?)"\t(.*)/<tr><th>$1<\/th><th>$2<\/th><th>$3<\/th><th>$4<\/th><th>$5<\/th><th>$6<\/th><th>$7<\/th><\/tr>/gi;
$tsv_content =~ s/"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\n/<tr><td>$1<\/td><td>$2<\/td><td>$3<\/td><td>$4<\/td><td>$5<\/td><td>$6<\/td><td>$7<\/td><\/tr>\n/gi;
$tsv_content =~ s/\"{2}/\"/sgi;
$tsv_content =~ s/(<\/tr>)\n?"/$1/sgi;
$tsv_content =~ s/\s{2,}/ /sgi;
$tsv_content =~ s/.*?(<tr>)/$1/si;
$tsv_content = "<table>\n$tsv_content</table>";
$classification =~ s/_//sgi;
if(exists $existing_index_hash{$doc_uid."_pind.html"})
{
if($existing_index_hash{$doc_uid."_pind.html"} ne $tsv_content)
{
$changed_flag = "1";
$updated_files = $updated_files."-$classification\n";
print "Updated: $classification\n";
Hope someone here can help me on this thank you
Here is a simple script that loads a CSV file specified as an argument and outputs it sorted by the first two columns.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = 'Text::CSV_XS'->new({binary => 1, auto_diag => 1});
open my $in, '<', shift or die $!;
my #rows;
while (my $row = $csv->getline($in)) {
push #rows, $row;
}
# Here the sorting happens. Compare the first column,
# if the values are the same, compare the second column.
#rows = sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } #rows;
$csv->say(*STDOUT, $_) for #rows;
You can use the following to sort by all columns (but it compares the values as strings, it doesn't work for numbers):
sub by_all {
my ($n, $A, $B) = #_;
$A->[$n] cmp $B->[$n]
|| $n < $#$A && by_all($n + 1, $A, $B)
}
sort { by_all(0, $a, $b) } #rows;
To make it work for numbers, too, you can let Perl guess what is a number:
use Scalar::Util qw{ looks_like_number };
sub by_all {
my ($n, $A, $B) = #_;
(looks_like_number($A->[$n])
? $A->[$n] <=> $B->[$n]
: $A->[$n] cmp $B->[$n]
) || $n < $#$A && by_all($n + 1, $A, $B)
}
now i have got a problem with preg_match.
This is an example string: "!asksheet!H69=var8949", there can also be more than one var8949 or H69 index in this row. Result shoud be "var33333=var8949"
This is my part:
preg_match_all('#\b\!(.*)\![A-Z]{1,3}\d+\b#', $output, $matches2);
foreach ($matches2[0] as $match2) {
$result6 = $db->query("SELECT varid FROM variablen WHERE varimportedindex = '".$match2."' AND projectid = $pid AND varsheetname LIKE '%".$match2."%' ");
$rowoperation2 = $result6->fetch_assoc();
if ($rowoperation2['varid'] != "" AND $rowoperation2['varid'] != "0") {
$output2 = preg_replace("#\b\!(.*)\![A-Z]{1,3}\d+\b#", "var".$rowoperation2['varid']."", $output);
}
}
Can someone perhaps help?
Thank you,
Regards
Olaf
Why not using a simple preg_match instead of preg_match_all, you don't need word boundary and exclamation mark doesn't need to be escaped, the strings you're looking for are in group 1 and 2:
$str = '"!asksheet!H69=var8949"';
preg_match('#!(.*?)!([A-Z]{1,3}\d+)#', $str, $m);
print_r($m);
Output:
Array
(
[0] => !asksheet!H69
[1] => asksheet
[2] => H69
)
I have a problem that I can not explain myself. I think it might be an error of PHP or Laravel - or I am doing something I should not.
I have a .csv file with country codes and country names
CSV file
I want to read the file and seed my database with the data.
Therefore I have the following code to get the CSV into an array:
$csv = public_path() . "/assets/countries.csv";
if(File::exists($csv)) {
$content = File::get($csv);
$lines = array();
$lines = explode("\n", $content);
for($i=0; $i<sizeof($lines);$i++) {
$line = $lines[$i];
$line = explode(",",$line);
$lines[$i] = $line;
}
}
So far so good, my $lines array has now all the values with each being an array with 2 indexes, 0 for the code and 1 for the country name.
Doing foreach($lines as $line) and var dumping $line, I get:
array(2) { [0]=> string(2) "AF" [1]=> string(12) "Afghanistan " }
.....
For each entry. But now the following happens:
echo $line[0]; // output: AF
echo $line[1]; // undefined offset 1 error
I tried to check if 1 is a string index or what so ever, see the following code+output:
foreach($lines as $line) {
var_dump($line);
echo 'array_key_exists(1, $line): ';
var_dump(array_key_exists(1, $line));
foreach($line as $key => $col) {
echo 'var_dump($key): ';
var_dump($key);
echo '$col: ' . $col;
echo '$line[$key]: ' . $line[$key];
}
}
(I deleted some echoed breaks for readability)
The code produces the following output (for the first result and similar for all others:)
array(2) { [0]=> string(2) "AF" [1]=> string(12) "Afghanistan " }
array_key_exists(1, $line): bool(true)
var_dump($key): int(0)
$col: AF
$line[$key]: AFvar_dump($key): int(1)
$col: Afghanistan
$line[$key]: Afghanistan
How can array_key_exists(1, $line) result in true, but $line[1] in an undefined offset: 1 error? Thanks for your help.
EDIT: $line[$key] is working, while $line[1] is not. var_dump(1 == $key) results in bool(true) in that case...
EDIT2: If I have the same code without using laravel (file_get_contents then) - I do not get an error. See this fiddle for the code
I was stupid. Thanks to the guys from laravel.io I realized that the error is neither within PHP nor Laravel but within my data.
There was an empty last row in .csv file. That file couldn't be exploded, resulting in only one single value in $lines[249] (last entry).
My excuses for this.
As an example:
I load in the input from a .txt:
Benjamin,Schuvlein,Germany,1912,M,White
I do some code that I will not post here for brevity and get to the link:
https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ
I want to scrape multiple things from that page. In the code below, I only do 1.
I'd also like to make each item be separated by a , in the output .txt.
And, I'd like the output to be preceded by the input.
I'm using the following packages in the code:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
Here's the relevant code:
my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ';
print "My URL is <$url>\n";
my $request = HTTP::Request->new(GET => $url);
$request->push_header('Content-Type' => 'application/json');
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
my $dom_tree = new HTML::DOM;
$dom_tree->write($response->content);
$dom_tree->close;
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
print $str;
print $o $str;
Desired Output (from that link) is something like:
Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....
(How much of that output section is scrapable?)
Any help on how to get the link within the link would be much appreciated!
This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.
I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.
X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
Am I to expect an email from you? I replied to your first mail but haven't heard since.
use strict;
use warnings;
use URI;
use LWP;
use HTML::TreeBuilder::XPath;
my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my #results = $tree->findnodes('//table[#class="result-data"]//tr[#class="result-item"]');
my %data;
for my $item (#results) {
my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
$key =~ s/:$//;
$data{$key} = $val;
}
my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
'event place', 'marital status', 'residence in 1935',
'relationship to head of household (standardized)';
print $record, "\n";
output
"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head
Try this
use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;
$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;
Which prints out
event place: = Assembly District 2, Queens, New York City, Queens, New York, United States
I also noticed this header:
X-Copyright:COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.
See also http://metacpan.org/pod/HTML::Element#SYNOPSIS
I thought I had answered your question.
The problem is that you are trying to fetch the webpage with LWP. Why are try to doing that if you already have WWW::Mechanize::Firefox?
Did you tried this?
It will retrieve and save each link for further analyses. A small change and you 'get' the DOM tree. Sorry, I do not have acccess to this page, so I just hope it will work.
my $i=1;
for my $link (#links) {
print Dumper $link->url;
print Dumper $link->text;
my $tempfile = './$i.html';$i++;
$mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
my $dom_tree = $mech->document();
my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();
}
EDIT:
Process the page content with regexp (Everyone: Please remember, there is always more than one way to do something wwith Perl!. It works, it is easy...)
it tried it out with this cmd:
wget -nd 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ' -O 1.html|cat 1.html|1.pl
use Data::Dumper;
use strict;
use warnings;
local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
my $td = $1;
print "td: $td\n";
my $td_val = $td;
$td_val =~ s!<[^>]*>!!gis;
$td_val =~ s!\s+! !gs;
$td_val =~ s!(\A\s+|\s+\z)!!gs;
if ($td =~ m!result-label!){ #primitive state machine, store the current label
print "current_label: $current_label\n";
$current_label = $td_val;
} elsif ($td =~ m!result-value!){ #add each data to current label
push(#{$data->{$current_label}},$td_val);
} else {
warn "found something else: $td\n";
}
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
process_multi($type,$data->{$type});
delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
process_simple($type,$data->{$type});
delete($data->{$type});
}
die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;
Output:
'line number:' => [
'28'
],
'estimated birth year:' => [
'1912'
],
'head' => [
'Benjamin Schuvlein',
'M',
'28',
'Germany'
],