Downloading pdf files with wget. (characters after file extension?) - shell

I'm trying to dowload recursively all .pdf files from a webpage.
The files URL have this format:
"http://example.com/fileexample.pdf?id=****"
I'm using these parameters:
wget -r -l1 -A.pdf http://example.com
wget is rejecting all the files when saving. Getting this error when using --debug:
Removing file due to recursive rejection criteria in recursive_retrieve()
I think that's happening because of this "?id=****" after the extension.

But did you try -A "*.pdf*" ? Regarding the wget docs, this should work out.

Related

wget: using wildcards in the middle of the path

I am trying to recursively download .nc files from: https://satdat.ngdc.noaa.gov/sem/goes/data/full/*/*/*/netcdf/*.nc
A target link looks like this one:
https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/netcdf/
and I need to exclude this:
https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/csv/
I do not understand how to use wildcards for defining path in wget.
Also, the following command (a test for year 1981 only), only downloads subfolders 10, 11 and 12, failing with {01..09} subfolders:
for i in {01..12};do wget -r -nH -np -x --force-directories -e robots=off https://satdat.ngdc.noaa.gov/sem/goes/data/full/1981/${i}/goes02/netcdf/; done
I do not understand how to use wildcards for defining path in wget.
According to GNU Wget manual
File name wildcard matching and recursive mirroring of directories are
available when retrieving via FTP.
so you must not use one in URL provided when working with HTTP or HTTPS server.
You might combine -r with --accept-regex urlregex to
Specify a regular expression to accept(...)the complete URL.
Observe that it should match whole URL, for example if I wish pages linked in GNU Package blurbs which contain auto in path I could do that by
wget -r --level=1 --accept-regex '.*auto.*' https://www.gnu.org/manual/blurbs.html
which result in download main pages of autoconf, autoconf-archive, autogen, automake. Note: --level=1 is used to prevent going further down than links shown in blurbs.

WGET saves with wrong file and extension name possibly due to BASH

I`ve tried this on a few forum threads already.
However I keep on getting the some failure as a result.
To replicate the problem :
Here is an url leading to a forum thread with 6 pages.
http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1/vc/1
What I typed into the console was :
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1"
And here is what I got:
--2018-06-14 10:44:17-- http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/%7B1..6%7D/vc/1
Resolving forex.kbpauk.ru (forex.kbpauk.ru)... 185.68.152.1
Connecting to forex.kbpauk.ru (forex.kbpauk.ru)|185.68.152.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: '1'
1 [ <=> ] 19.50K 58.7KB/s in 0.3s
2018-06-14 10:44:17 (58.7 KB/s) - '1' saved [19970]
The file was saved as simply "1" with no extension as it seems.
My expectations were that the file will be saved with an .html extension, because its a webpage.
Im trying to get WGET to work, but if its possible to do what I want with CURL than I would also accept that as an answer.
Well, there's a couple of issues with what you're trying to do.
The double quotes around your URL actually prevent Bash expansion, so you're not really downloading 6 files, but a single URL with "{1..6}" in it. You probably want to not have quotes around the URL to allow bash to expand it into 6 different parameters.
I notice that all of the pages are called "1", irrespective of their actual page numbers. This means the server is always serving a page with the same name, making it very hard for Wget or any other tool to actually make a copy of the webpage.
The real way to create a mirror of the forum would be to use this command line:
$ wget -m --no-parent -k --adjust-extension http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1
Let me explain what this command does:
-m --mirror activates the mirror mode (recursion)
--no-parent asks Wget to not go above the directory it starts from
-k --convert-links will edit the HTML pages you download so that the links in them will point to the other local pages you have also downloaded. This allows you to browse the forum pages locally without needing to be online
--adjust-extension This is the option you were originally looking for. It will cause Wget to save the file with a .html extension if it downloads a text/html file but the server did not provide an extension.
simply use the -O switch to specify the output filename, otherwise wget just defaults to something like in your case its 1
so if you wanted to call your file what-i-want-to-call-it.html then you would do
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1" -o what-i-want-to-call-it.html
if you type into the console wget --help you will get a full list of all the options that wget provides
To verify it has worked type the following to output
cat what-i-want-to-call-it.html

How to parse html with wget to download an artifact using pattern matching against Jenkins

I am trying to download an artifact from Jenkins where I need the latest build. If I curl jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build it brings me to the page that contains the artifact I need to download, which in my case is myCompany-1234.ipa
so by changing curl to wget with --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ it downloads the index.html file.
if I put --reject index.html it stops the index.html downloading.
if I add the name of the artifact with a wildcard like so MyCompany-*.ipait downloads a 14k file named MyCompany-*.ipa, not the MyCompany-1234.ipa I was hoping for. Keep in mind the page i am requesting only has 1 MyCompany-1234.ipa so there will never be multiple matches found
if I use a flag to pattern match -A "*.ipa" like so: wget --auth-no-challenge -A "*.ipa" https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ It still doesn't download the artifact.
It works if I perfectly input the exact url like so: wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/MyCompany-1234.ipa
The problem here is the .ipa is not always going to be 1234, tomorrow will be 1235, and so on. How can I either parse the html or use the wildcard correctly in wget to ensure I am always getting the latest?
NM, working with another engineer here at my work came up with a super elegant solution parsing json.
Install Chrome and get the plugin JSONView
Call the Jenkins API in your Chrome browser using https://$domain/$job/lastSuccessfulBuild/api/json
This will print out the key pair values in json. Denote your Key, for me it was number
brew install jq
In a bash script create a variable that will store the dynamic value as follows
This will store the build number to latest:
latest=$(curl --silent --show-error https://userIsMe:123myapikey321#jenkins.mycompany.com/job/build_ios/lastSuccessfulBuild/api/json | jq '.number')
Print it to screen if you would like:
echo $latest
Now with some string interpolation pass the variable for latest to your wget call:
wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/myCompany-$latest.ipa
Hope this helps someone out as there is limited info out there that is clear and concise, especially given that wget has been around for an eternity.

Downloading Only Newest File Using Wget / Curl

How would I use wget or curl to download the newest file in a directory?
This seems really easy, however the filename won't always be predictable, and as new data comes in it'll be replaced with a random filename.
Specifically, the directory I wish to download data from has the following naming structure, where the last string of characters is a randomly generated timestamp:
MRMS_RotationTrackML1440min_00.50_20160530-175837.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-182639.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-185637.grib2.gz
The randomly generated timestamp is in the format of: {hour}{minute}{second}
The directory in question is here: http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
Could it have to be something with something in the headers, where you'd use curl to sift through the last-modified timestamp?
Any help would be appreciated here, thanks in advance.
You can just run following command periodically:
wget -r -nc --level=1 http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
It will download recursively whatever is new in the directory after last run.

How to combine url with filename from file

Text file (filename: listing.txt) with names of files as its contents:
ace.pdf
123.pdf
hello.pdf
Wanted to download these files from url http://www.myurl.com/
In bash, tried to merged these together and download the files using wget eg:
http://www.myurl.com/ace.pdf
http://www.myurl.com/123.pdf
http://www.myurl.com/hello.pdf
Tried variations of the following but without success:
for i in $(cat listing.txt); do wget http://www.myurl.com/$i; done
No need to use cat and loop. You can use xargs for this:
xargs -I {} wget http://www.myurl.com/{} < listing.txt
Actually, wget has options which can avoid loops & external programs completely.
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.)
If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html
is not specified, then file should consist of a series of URLs, one per line.
However, if you specify --force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding "<base href="url">" to the documents or by
specifying --base=url on the command line.
If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html. Furthermore, the file's location will be implicitly used as base href if none was specified.
-B URL
--base=URL
Resolves relative links using URL as the point of reference, when reading links from an HTML file specified via the -i/--input-file option (together with --force-html, or when the input file was fetched remotely from a
server describing it as HTML). This is equivalent to the presence of a "BASE" tag in the HTML input file, with URL as the value for the "href" attribute.
For instance, if you specify http://foo/bar/a.html for URL, and Wget reads ../baz/b.html from the input file, it would be resolved to http://foo/baz/b.html.
Thus,
$ cat listing.txt
ace.pdf
123.pdf
hello.pdf
$ wget -B http://www.myurl.com/ -i listing.txt
This will download all the 3 files.

Resources