Spider a website and retrieve all links that contain a keyword

Spider a website and retrieve all links that contain a keyword - bash

How do I make a Bash script that will copy all links (non-download website). The function is only to get all the links and then save it in a txt file.
I've tried this code:
wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'
Example: there are download links within a website (for example, dlink.com), so I just want to copy all words that contain dlink.com and save it into a txt file.
I've searched around using Google, and I found none of it useful.

Using a proper parser in Perl:
#!/usr/bin/env perl -w
use strict;
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
my $ua = LWP::UserAgent->new;
my ($url, $f, $p, $res);
if(#ARGV) {
$url = $ARGV[0]; }
else {
print "Enter an URL : ";
$url = <>;
chomp($url);
}
my #array = ();
sub callback {
my($tag, %attr) = #_;
return if $tag ne 'a'; # we only look closer at <a href ...>
push(#array, values %attr) if $attr{href} =~ /dlink\.com/i;
}
# Make the parser. Unfortunately, we don’t know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all URLs to absolute ones
my $base = $res->base;
#array = map { $_ = url($_, $base)->abs; } #array;
# Print them out
print join("\n", #array), "\n";

Related

Executing Perl script from windows-command line with 2 entry

this is my Perl script
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
sub xml2array{
my $path = shift;
my $twig = XML::Twig->new->parsefile($path);
return map { $_ -> att('VirtualPath') } $twig -> get_xpath('//Signals');
}
sub compareMappingToArray {
my $mapping = shift;
my $signalsRef = shift;
my $i = 1;
print "In file : $mapping\n";
open(my $fh, $mapping);
while (my $r = <$fh>) {
chomp $r;
if ($r =~ /\'(ModelSpecific.*)\'/) {
my $s = $1;
my #matches = grep { /^$s$/ } #{$signalsRef};
print "line $i : not found - $s\n" if scalar #matches ==0;
print "line $i : multiple $s\n" if scalar #matches > 1;
}
$i = $i + 1 # keep line index
}
}
my $mapping = "C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm";
my #virtualpath = xml2array("SignalModel.xml");
compareMappingToArray($mapping, \#virtualpath);
The script works well, the aim of it is to compare the file "SignalModel.xml" and "CAN_ESP_002_mapping.pm" and putting the lines that didn't matches in a .TXT file. Here is how the .TXT file looks like:
In file : C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm
line 331 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_202.R2B_VCU_202__byte_3
line 348 : not found - ModelSpecific.EID.NET.CAN_Engine.CMM_WX.Transmit.CMM_HYB_208.R2B_CMM_HYB_208__byte_2
line 368 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_222.R2B_VCU_222__byte_0
But for this script, I put the two files that need to be compare inside of the code and instead of doing that, I would like to run the script in windows cmd line and having something like:
C:\Users>perl CANMappingChecker.pl -'file 1' 'file 2'
All the files are in .zip file so if I can execute the script that he goes inside and take the 2 files that I need for comparison, it should be perfect.
I really don't know how to do and what to put inside my script to make that in the cmd windows. Thanks for your help !

Program (or script) parameters are stored in the #ARGV array. shift and pop without any parameter will work on #ARGV when used outside of a sub, in a sub they operate on #_.
See Archive::Zip for zip file handling.

Can't locate WWW/Mechanize.pm: Using cpanm doesn't work

I know this is a duplicate, but my question was not answered in any other threads. the output of sudo cpanm WWW::Mechanize is to long to put in tread. pastebin: 3BYUtSss
I tried executing a perl script, and I get this error:
Can't locate WWW/Mechanize.pm in #INC (#INC contains: /opt/local/lib/perl5/site_perl/5.16.3/darwin-thread-multi-2level /opt/local/lib/perl5/site_perl/5.16.3 /opt/local/lib/perl5/vendor_perl/5.16.3/darwin-thread-multi-2level /opt/local/lib/perl5/vendor_perl/5.16.3 /opt/local/lib/perl5/5.16.3/darwin-thread-multi-2level /opt/local/lib/perl5/5.16.3 /opt/local/lib/perl5/site_perl /opt/local/lib/perl5/vendor_perl .) at io.pl line 5.
In case you need it, here is my perl script's contents:
#!/usr/bin/env perl
use warnings;
use strict;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my ($get,$host,$title);
while (<>) {
if (m|^GET (\S+) |) {
$get = $1;
} elsif ( m|^Host: (\S+)\.| ) {
$host = $1;
} else {
# Unrecognized line...reset
$get = $host = $title = '';
}
if ($get and $host) {
my ($title) = $get =~ m|^.*\/(.+?)$|; # default title
my $url = 'http://' . $host . $get;
$mech->get($url);
if ($mech->success) {
# HTML may have title, images will not
$title = $mech->title() || $title;
}
print "Title: $title\n";
print "URL: $url\n";
print "\n";
$get = $host = $title = '';
}
}

These look to be the key lines in the output from cpanm, down at the bottom.
! Installing the dependencies failed: Installed version (3.59) of CGI is not in range '4.08'
! Bailing out the installation for WWW-Mechanize-1.75.
Looks like you need to install a higher version of the CGI distribution.

The key lines in the cpanm output are:
Building and testing CGI-4.21 ... FAIL
! Installing CGI failed. See /Users/skylerspaeth/.cpanm/work/1440436409.90704/build.log for details. Retry with --force to force install it.
So look in /Users/skylerspaeth/.cpanm/work/1440436409.90704/build.log and see what the problem is. If that log is no longer there, you may need to run cpanm again, which will generate another build.log.
You find the key lines in cpanm output by searching for "fail". Usually, it'll point you at a build.log file for further details.

Need to split multiple files in a directory based on string, rename properly using powershell or fix my perl script

I have a directory full of files (text exports of Dynamics NAV objects that have been exported) in Windows. Each file contains multiple objects. I need to split each file into separate files based on lines that begin with OBJECT, and name each file appropriately.
The purpose of this is to get our Dynamics NAV system into git.
I wrote a nifty perl program to do this that works great on linux. But it hangs on the while(<>) loop in Windows (Server 2012 if that matters).
So, I need to either figure out how to do this in the PowerShell script that I wrote that generates all of the files, or fix my perl script that I'm calling from PowerShell. Does Windows perl handle filehandles differently than linux?
Here's my code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Path qw(make_path remove_tree);
use POSIX qw(strftime);
my $username = getlogin || getpwuid($<);
my $datestamp = strftime("%Y%m%d-%H%M%S", localtime);
my $work_dir = "/temp/nav_export";
my $objects_dir = "$work_dir/$username/objects";
my $export_dir = "$work_dir/$username/$datestamp";
print "Objects being exported to $export_dir\n";
make_path("$export_dir/Page", "$export_dir/Codeunit", "$export_dir/MenuSuite", "$export_dir/Query", "$export_dir/Report", "$export_dir/Table", "$export_dir/XMLport");
chdir $objects_dir or die "Could not change to $objects_dir: $!";
# delete empty files
foreach(glob('*.*')) {
unlink if -f and !-s _;
}
my #files = <*>;
my $count = #files;
print "Processing $count files\n";
open (my $fh, ">-") or die "Could not open standard out: $!";
# OBJECT Codeunit 1 ApplicationManagement
while(<>)
{
if (m/^OBJECT ([A-Za-z]+) ([0-9]+) (.*)/o)
{
my $objectType = $1;
my $objectID = $2;
my $objectName = my $firstLine = $3;
$objectName =~ s/[\. \/\(\)\\]/_/g; # translate spaces, (, ), ., \ and / to underscores
$objectName =~ tr/\cM//d; # get rid of Ctrl-M
my $filename = $export_dir . "/" . $objectType . "/" . $objectType . "~" . $objectID . "~" . $objectName;
close $fh and open($fh, '>', $filename) or die "Could not open file '$filename' $!";
print $fh "OBJECT $objectType $objectID $firstLine\n";
next;
}
print $fh $_;
}
I've learned quite a bit of PowerShell in the past few days. There are some things that it really does quite well. And some (such as calling an executable with variables and command line options that have spaces) that are maddeningly difficult to figure out. To call curl, this is what I resorted to:
$curl = "C:\Program Files (x86)\cURL\bin\curl"
$arg10 = '-s'
$arg1 = '-X'
$arg11 = 'post'
$arg2 = '-H'
$arg22 = '"Accept-Encoding: gzip,deflate"'
$arg3 = '-H'
$arg33 = '"Content-Type: text/xml;charset=UTF-8"'
$arg4 = '-H'
$arg44 = '"SOAPAction:urn:microsoft-dynamics-schemas/page/permissionrange:ReadMultiple"'
$arg5 = '--ntlm'
$arg6 = '-u'
$arg66 = 'username:password'
$arg7 = '-d'
$arg77 = '"#soap_envelope.txt"'
$arg8 = "http://$servicetier.corp.company.net:7047/$database/WS/DBDOC/Page/PermissionRange"
$arg9 = "-o"
$arg99 = "c:\temp\nav_export\$env:username\raw_list.xml"
&"$curl" $arg10 $arg1 $arg11 $arg2 $arg22 $arg3 $arg33 $arg4 $arg44 $arg5 $arg6 $arg66 $arg7 $arg77 $arg8 $arg9 $arg99
I realize that part is a bit of a tangent. But I've been working really hard at trying to figure this out and not have to bother you nice folk here at stackoverflow!
I'm ambivalent about making it work in PowerShell or fixing the Perl code at this point. I just need to make it work. But I'm hoping it's just some little difference in filehandle handling between linux and Windows.

It's hard to believe that the Perl code that you show does anything on Linux either. It looks like your while loop is supposed to be reading through all of the files in the #files array, but to make it do that you have to copy the names to #ARGV.
Also note that #files will contain directories as well as files.
I suggest you change the lines starting with my #files = <*> to this. There's no reason why it shouldn't work on both Windows and Linux.
our #ARGV = grep -f, glob '*';
my $count = #ARGV;
print "Processing $count files\n";
my $fh;
while (<>) {
s/\s+\z//; # Remove trailing whitespace (including CR and LF)
my #fields = split ' ', $_, 4;
if ( #fields == 4 and $fields[0] eq 'OBJECT' ) {
my ($object_type, $object_id, $object_name) = #fields[1,2,3];
$object_name =~ tr{ ().\\/}{_}; # translate spaces, (, ), ., \ and / to underscores
my $filename = "$export_dir/$object_type/$object_type~$object_id~$object_name";
open $fh, '>', $filename or die "Could not open file '$filename': $!";
}
print $fh "$_\n" if $fh;
if (eof) {
close $fh;
$fh = undef;
}
}

How can I check if stdin exists in PHP ( php-cgi )?

Setup and Background
I am working on script that needs to run as /usr/bin/php-cgi instead /usr/local/bin/php and I'm having trouble checking for stdin
If I use /usr/local/bin/php as the interpreter I can do something like
if defined('STDIN'){ ... }
This doesn't seem to work with php-cgi - Looks to always be undefined. I checked the man page for php-cgi but didn't find it very helpful. Also, if I understand it correctly, the STDIN constant is a file handle for php://stdin. I read somewhere that constant is not supposed to be available in php-cgi
Requirements
The shebang needs to be #!/usr/bin/php-cgi -q
The script will sometimes be passed arguments
The script will sometimes receive input via STDIN
Current Script
#!/usr/bin/php-cgi -q
<?php
$stdin = '';
$fh = fopen('php://stdin', 'r');
if($fh)
{
while ($line = fgets( $fh )) {
$stdin .= $line;
}
fclose($fh);
}
echo $stdin;
Problematic Behavior
This works OK:
$ echo hello | ./myscript.php
hello
This just hangs:
./myscript.php
These things don't work for me:
Checking defined('STDIN') // always returns false
Looking to see if CONTENT_LENGTH is defined
Checking variables and constants
I have added this to the script and run it both ways:
print_r(get_defined_constants());
print_r($GLOBALS);
print_r($_COOKIE);
print_r($_ENV);
print_r($_FILES);
print_r($_GET);
print_r($_POST);
print_r($_REQUEST);
print_r($_SERVER);
echo shell_exec('printenv');
I then diff'ed the output and it is the same.
I don't know any other way to check for / get stdin via php-cgi without locking up the script if it does not exist.
/usr/bin/php-cgi -v yields: PHP 5.4.17 (cgi-fcgi)

You can use the select function such as:
$stdin = '';
$fh = fopen('php://stdin', 'r');
$read = array($fh);
$write = NULL;
$except = NULL;
if ( stream_select( $read, $write, $except, 0 ) === 1 ) {
while ($line = fgets( $fh )) {
$stdin .= $line;
}
}
fclose($fh);

Regarding your specific problem of hanging when there is no input: php stream reads are blocking operations by default. You can change that behavior with stream_set_blocking(). Like so:
$fh = fopen('php://stdin', 'r');
stream_set_blocking($fh, false);
$stdin = fgets($fh);
echo "stdin: '$stdin'"; // immediately returns "stdin: ''"
Note that this solution does not work with that magic file handle STDIN.

stream_get_meta_data helped me :)
And as mentioned in the previous answer by Seth Battin stream_set_blocking($fh, false); works very well 👍
The next code reads data from the command line if provided and skips when it's not.
For example:
echo "x" | php render.php
and php render.php
In the first case, I provide some data from another stream (I really need to see the changed files from git, something like git status | php render.php.
Here is an example of my solution which works:
$input = [];
$fp = fopen('php://stdin', 'r+');
$info = stream_get_meta_data($fp);
if (!$info['seekable'] && $fp) {
while (false !== ($line = fgets($fp))) {
$input[] = trim($line);
}
fclose($fp);
}

The problem is that you create a endless loop with the while($line = fgets($fh)) part in your code.
$stdin = '';
$fh = fopen('php://stdin','r');
if($fh) {
// read *one* line from stdin upto "\r\n"
$stdin = fgets($fh);
fclose($fh);
}
echo $stdin;
The above would work if you're passing arguments like echo foo=bar | ./myscript.php and will read a single line when you call it like ./myscript.php
If you like to read more lines and keep your original code you can send a quit signal CTRL + D
To get parameters passed like ./myscript.php foo=bar you could check the contents of the $argv variable, in which the first argument always is the name of the executing script:
./myscript.php foo=bar
// File: myscript.php
$stdin = '';
for($i = 1; $i < count($argv); i++) {
$stdin .= $argv[$i];
}
I'm not sure that this solves anything but perhaps it give you some ideas.

Perl - Using Variables from an Input File in the URL when a Variable has a Space (two words)

What am I doing? The script loads a string from a .txt (locations.txt), and separates it into 6 variables. Each variable is separated by a comma. Then I go to a website, whose address depends on these 6 values.
What is the problem? If there is a space as a character in a variable as part of a string in locations.txt. When there is a space, it does not get the correct url.
The input file is:
locations.txt = Heinz,Weber,Sierra Leone,1915,M,White
Because Sierra Leone has a space, the url is:
https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3AHeinz%20%2Bsurname%3AWeber%20%2Bbirth_place%3A%22Sierra%20Leone%22%20%2Bbirth_year%3A1914-1918~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
But that does not get processed correctly in the code below.
I'm using the packages:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use HTML::TableExtract;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
This is the beginning of the code :
open(my $l, 'locations26.txt') or die "Can't open locations: $!";
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
while (my $line = <$l>) {
chomp $line;
my %args;
#args{qw/givenname surname birth_place birth_year gender race/} = split /,/, $line;
$args{birth_year} = ($args{birth_year} - 2) . '-' . ($args{birth_year} + 2);
my $mech = WWW::Mechanize::Firefox->new(create => 1, activate => 1);
$mech->get("https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219");
# REST OF THE SCRIPT HERE. MANY LINES.
}
As another example, the following would work:
locations.txt = Benjamin,Schuvlein,Germany,1913,M,White

I have not used Mechanize, so not sure whether you need to encode the URL. Try encoding space to %20 or + before running $mech->get
$url =~ s/ /+/g;
Or
$url =~ s/ /%20/g
whichever works :)
====
Edit:
my $url = "https://familysearch.org/search/collection/results#count=20& query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219";
$url =~ s/ /+/g;
$mech->get($url);
Try that.

If you have the error
Global symbol "$url" requires explicit package name.
this means that you forgot to declare $url with :
my $url;
Your use part seems freaky, I'm pretty sure that you don't need all of those modules # the same time. If you use WWW::Mechanize, no need LWP::UserAgent and CGI I guess...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Spider a website and retrieve all links that contain a keyword - bash

Related

Executing Perl script from windows-command line with 2 entry

Can't locate WWW/Mechanize.pm: Using cpanm doesn't work

Need to split multiple files in a directory based on string, rename properly using powershell or fix my perl script

How can I check if stdin exists in PHP ( php-cgi )?

Perl - Using Variables from an Input File in the URL when a Variable has a Space (two words)

Categories

Resources