Getting a list of uls with wget using regex - bash

I'm starting with page:
https://mysite/a"
I'd like to spider the page getting the full urls of any nested urls below this that begin with the same stem (like https://mysite/a/b ).
I've tried:
$ wget -r --spider --accept-regex "https://...*" 'https://.../' 2>test.txt
which produces a large amount of output inclusing what appear to be the urls I'm after like:
--2018-04-21 15:04:48-- https:/mysite/a/
Reusing existing connection to mysite:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'a/index.html.tmp.tmp'
How do I just print out a list of the urls?
Edit:
changed it to
$ wget -r --spider 'https://mysite/a/' |grep 'https://mysite/a*' 2>test.txt
as a test . No output is being saved in test.txt. The file is empty.

Related

How to stream into wget?

tac FILE | sed -n -e 's/^.*URL: //p' | SEND TO WGET HERE
This one liner above gives a list of URLs from a file, one per line. I am trying to stream/pipe these into wget directly. Each URL is a thumbnail picture that I need to do a massive download on. Trying to write this one liner to facilitate this process.
This one liner above gives a list of URLs from a file, one per line. I
am trying to (...) pipe these into wget directly.
In order to do so you might harness -i file option, if you give - as file wget will be reading standard input, from wget man page
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input(...)If this function is
used, no URLs need be present on the command line(...)
So in your case
command | wget -i -
where command is command which output is one URL per line
Use xargs to set the argument of a command from standard input:
tac FILE | sed -n -e 's/^.*URL: //p' | xargs wget
Here each word of the standard input of xargs is set as a positional argument to wget
Demo:
$ cat FILE
URL: https://google.com https://netflix.com
asdfdas URL: https://stackoverflow.com
$ tac FILE | sed -n -e 's/^.*URL: //p' | xargs wget
--2021-12-30 12:53:17-- https://stackoverflow.com/
Resolving stackoverflow.com (stackoverflow.com)... 151.101.65.69, 151.101.193.69, 151.101.129.69, ...
Connecting to stackoverflow.com (stackoverflow.com)|151.101.65.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.7’
index.html.7 [ <=> ] 175,76K 427KB/s in 0,4s
2021-12-30 12:53:18 (427 KB/s) - ‘index.html.7’ saved [179983]
--2021-12-30 12:53:18-- https://google.com/
Resolving google.com (google.com)... 142.250.184.142, 2a00:1450:4017:80c::200e
Connecting to google.com (google.com)|142.250.184.142|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.google.com/ [following]
--2021-12-30 12:53:18-- https://www.google.com/
Resolving www.google.com (www.google.com)... 142.250.187.100, 2a00:1450:4017:807::2004
Connecting to www.google.com (www.google.com)|142.250.187.100|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://consent.google.com/ml?continue=https://www.google.com/&gl=GR&m=0&pc=shp&hl=el&src=1 [following]
--2021-12-30 12:53:19-- https://consent.google.com/ml?continue=https://www.google.com/&gl=GR&m=0&pc=shp&hl=el&src=1
Resolving consent.google.com (consent.google.com)... 216.58.206.206, 2a00:1450:4017:80c::200e
Connecting to consent.google.com (consent.google.com)|216.58.206.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.8’
index.html.8 [ <=> ] 12,16K --.-KB/s in 0,01s
2021-12-30 12:53:19 (1,25 MB/s) - ‘index.html.8’ saved [12450]
--2021-12-30 12:53:19-- https://netflix.com/
Resolving netflix.com (netflix.com)... 54.155.246.232, 18.200.8.190, 54.73.148.110, ...
Connecting to netflix.com (netflix.com)|54.155.246.232|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.netflix.com/ [following]
--2021-12-30 12:53:19-- https://www.netflix.com/
Resolving www.netflix.com (www.netflix.com)... 54.155.178.5, 3.251.50.149, 54.74.73.31, ...
Connecting to www.netflix.com (www.netflix.com)|54.155.178.5|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.netflix.com/gr-en/ [following]
--2021-12-30 12:53:20-- https://www.netflix.com/gr-en/
Reusing existing connection to www.netflix.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.9’
index.html.9 [ <=> ] 424,83K 1003KB/s in 0,4s
2021-12-30 12:53:21 (1003 KB/s) - ‘index.html.9’ saved [435027]
FINISHED --2021-12-30 12:53:21--
Total wall clock time: 4,1s
Downloaded: 3 files, 613K in 0,8s (725 KB/s)

regular expression inside a cURL call

I have a cURL call like this:
curl --silent --max-filesize 500 --write-out "%{http_code}\t%{url_effective}\n" 'http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.[200-210].dmg' -o /dev/null
This call generates a list of of URLs with the HTTP code (200 or 404 normally) like this:
404 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.203.dmg
404 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.204.dmg
200 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.205.dmg
404 http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.206.dmg
The only valid URLs are the ones preceded by the 200 HTTP code, so I would like to put a regular expression in the cURL so that it only downloads the lines that start with 200
Any ideas on how to do this without being a bash script?
Thank you in advance
You can use the following :
curl --silent -f --max-filesize 500 --write-out "%{http_code}\t%{url_effective}\n" -o '#1.dmg' 'http://fmdl.filemaker.com/maint/107-85rel/fmpa_17.0.2.[200-210].dmg'
This will try to reach every url and when it's not a 404 nor too large download it into a file whose name will be based on the index in the url.
The -f flag makes it avoid to output the content of the response when the HTTP code isn't a success one, while the -o flag specifies an output file, where #1 corresponds to the effective value of your [200-210] range (adding other [] or {} would let you refer to other parts of the URL by their index).
Note that during my tests, the --max-filesize 500 flag prevented the download of the only url which didn't end up in a 404, fmpa_17.0.2.205.dmg

Parse download speed from wget output in terminal

I have the following command
sudo wget --output-document=/dev/null http://speedtest.pixelwolf.ch which outputs
--2016-03-27 17:15:47-- http://speedtest.pixelwolf.ch/
Resolving speedtest.pixelwolf.ch (speedtest.pixelwolf.ch)... 178.63.18.88, 2a02:418:3102::6
Connecting to speedtest.pixelwolf.ch (speedtest.pixelwolf.ch) | 178.63.18.88|:80... connected.
HTTP Request sent, awaiting response... 200 OK
Length: 85 [text/html]
Saving to: `/dev/null`
100%[======================>]85 --.-K/s in 0s
2016-03-27 17:15:47 (8.79 MB/s) - `dev/null` saved [85/85]
I'd like to be able to parse the (8.79 MB/s) from the last line and store this in a file (or any other way I can get this into a local PHP file easily), I tried to store the full output by changing my command to --output-document=/dev/speedtest however this just saved "Could not reach website" in the file and not the terminal output of the command.
Not quite sure where to start with this, so any help would be awesome.
Not sure if it helps, but my intention is for this stored value (8.79) in this instance to be read by a PHP file and handled there, every 30 seconds which I'll achieve by: while true; do (run speed test and save speed variable to a file cmd); php handleSpeedTest.php; sleep 5; done where handleSpeedTest.php will be able to read that stored value and handle it accordingly.
I changed the URL to one that works. Redirected stderr onto stdout. Used grep --only-matching (-o) and a regex.
sudo wget -O /dev/null http://www.google.com 2>&1 | grep -o '\([0-9.]\+ [KM]B/s\)'

Why is wget saving something when using parametrized url?

I am using following command in my bash script to trigger jenkins build:
wget --no-check-certificate "http://<jenkins_url>/view/some_view/job/some_prj/buildWithParameters?token=xxx"
Output:
HTTP request sent, awaiting response... 201 Created
Length: 0
Saving to: “buildWithParameters?token=xxx”
[ <=> ] 0 --.-K/s in 0s
2015-02-20 10:10:46 (0.00 B/s) - “buildWithParameters?token=xxx” saved [0/0]
And then it's creates empty file: “buildWithParameters?token=xxx”
My question is: why wget creates this file and how to turn that functionality off?
Most simply:
wget --no-check-certificate -O /dev/null http://foo
this will make wget save the file to /dev/null, effectively discarding it.

Why does my wget in Bash result in a 400 Bad Request error?

I have this data in text file:
-O BNU-ESM-pr-Historical-19560101-19601231.nc https://dataserver.nccs.nasa.gov/thredds/ncss/bypass/NEX-GDDP/bcsd/historical/r1i1p1/pr/BNU-ESM.ncml?var=pr&north=55&west=72&east=136&south=16&horizStride=1&time_start=1956-01-01T12%3A00%3A00Z&time_end=1960-12-31T12%3A00%3A00Z&timeStride=1
I am using this code for a .sh file:
#!/bin/bash
while read -r line; do wget $line; done < pr_china.txt
Result of the command in BASH:
ahmad#ahmad:/mnt/c/script_sh_files$ ./pr_china.sh
--2018-12-29 23:10:30-- https://dataserver.nccs.nasa.gov/thredds/ncss/bypass/NEX-GDDP/bcsd/historical/r1i1p1/pr/BNU-ESM.ncml?var=pr&north=55&west=72&east=136&south=16&horizStride=1&time_start=1956-01-01T12%3A00%3A00Z&time_end=1960-12-31T12%3A00%3A00Z&timeStride=1%0D
Resolving dataserver.nccs.nasa.gov (dataserver.nccs.nasa.gov)... 2001:4d0:2418:2800::a99a:9229, 169.154.146.41
Connecting to dataserver.nccs.nasa.gov (dataserver.nccs.nasa.gov)|2001:4d0:2418:2800::a99a:9229|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-12-29 23:10:33 ERROR 400: Bad Request.
Run dos2unix on your pr_china.txt file before you use it.
See: How to remove %0D from end of URL when using wget?

Resources