WGET saves with wrong file and extension name possibly due to BASH - bash

I`ve tried this on a few forum threads already.
However I keep on getting the some failure as a result.
To replicate the problem :
Here is an url leading to a forum thread with 6 pages.
http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1/vc/1
What I typed into the console was :
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1"
And here is what I got:
--2018-06-14 10:44:17-- http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/%7B1..6%7D/vc/1
Resolving forex.kbpauk.ru (forex.kbpauk.ru)... 185.68.152.1
Connecting to forex.kbpauk.ru (forex.kbpauk.ru)|185.68.152.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: '1'
1 [ <=> ] 19.50K 58.7KB/s in 0.3s
2018-06-14 10:44:17 (58.7 KB/s) - '1' saved [19970]
The file was saved as simply "1" with no extension as it seems.
My expectations were that the file will be saved with an .html extension, because its a webpage.
Im trying to get WGET to work, but if its possible to do what I want with CURL than I would also accept that as an answer.

Well, there's a couple of issues with what you're trying to do.
The double quotes around your URL actually prevent Bash expansion, so you're not really downloading 6 files, but a single URL with "{1..6}" in it. You probably want to not have quotes around the URL to allow bash to expand it into 6 different parameters.
I notice that all of the pages are called "1", irrespective of their actual page numbers. This means the server is always serving a page with the same name, making it very hard for Wget or any other tool to actually make a copy of the webpage.
The real way to create a mirror of the forum would be to use this command line:
$ wget -m --no-parent -k --adjust-extension http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1
Let me explain what this command does:
-m --mirror activates the mirror mode (recursion)
--no-parent asks Wget to not go above the directory it starts from
-k --convert-links will edit the HTML pages you download so that the links in them will point to the other local pages you have also downloaded. This allows you to browse the forum pages locally without needing to be online
--adjust-extension This is the option you were originally looking for. It will cause Wget to save the file with a .html extension if it downloads a text/html file but the server did not provide an extension.

simply use the -O switch to specify the output filename, otherwise wget just defaults to something like in your case its 1
so if you wanted to call your file what-i-want-to-call-it.html then you would do
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1" -o what-i-want-to-call-it.html
if you type into the console wget --help you will get a full list of all the options that wget provides
To verify it has worked type the following to output
cat what-i-want-to-call-it.html

Related

wget: using wildcards in the middle of the path

I am trying to recursively download .nc files from: https://satdat.ngdc.noaa.gov/sem/goes/data/full/*/*/*/netcdf/*.nc
A target link looks like this one:
https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/netcdf/
and I need to exclude this:
https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/csv/
I do not understand how to use wildcards for defining path in wget.
Also, the following command (a test for year 1981 only), only downloads subfolders 10, 11 and 12, failing with {01..09} subfolders:
for i in {01..12};do wget -r -nH -np -x --force-directories -e robots=off https://satdat.ngdc.noaa.gov/sem/goes/data/full/1981/${i}/goes02/netcdf/; done
I do not understand how to use wildcards for defining path in wget.
According to GNU Wget manual
File name wildcard matching and recursive mirroring of directories are
available when retrieving via FTP.
so you must not use one in URL provided when working with HTTP or HTTPS server.
You might combine -r with --accept-regex urlregex to
Specify a regular expression to accept(...)the complete URL.
Observe that it should match whole URL, for example if I wish pages linked in GNU Package blurbs which contain auto in path I could do that by
wget -r --level=1 --accept-regex '.*auto.*' https://www.gnu.org/manual/blurbs.html
which result in download main pages of autoconf, autoconf-archive, autogen, automake. Note: --level=1 is used to prevent going further down than links shown in blurbs.

Downloading pdf files with wget. (characters after file extension?)

I'm trying to dowload recursively all .pdf files from a webpage.
The files URL have this format:
"http://example.com/fileexample.pdf?id=****"
I'm using these parameters:
wget -r -l1 -A.pdf http://example.com
wget is rejecting all the files when saving. Getting this error when using --debug:
Removing file due to recursive rejection criteria in recursive_retrieve()
I think that's happening because of this "?id=****" after the extension.
But did you try -A "*.pdf*" ? Regarding the wget docs, this should work out.

How to parse html with wget to download an artifact using pattern matching against Jenkins

I am trying to download an artifact from Jenkins where I need the latest build. If I curl jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build it brings me to the page that contains the artifact I need to download, which in my case is myCompany-1234.ipa
so by changing curl to wget with --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ it downloads the index.html file.
if I put --reject index.html it stops the index.html downloading.
if I add the name of the artifact with a wildcard like so MyCompany-*.ipait downloads a 14k file named MyCompany-*.ipa, not the MyCompany-1234.ipa I was hoping for. Keep in mind the page i am requesting only has 1 MyCompany-1234.ipa so there will never be multiple matches found
if I use a flag to pattern match -A "*.ipa" like so: wget --auth-no-challenge -A "*.ipa" https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ It still doesn't download the artifact.
It works if I perfectly input the exact url like so: wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/MyCompany-1234.ipa
The problem here is the .ipa is not always going to be 1234, tomorrow will be 1235, and so on. How can I either parse the html or use the wildcard correctly in wget to ensure I am always getting the latest?
NM, working with another engineer here at my work came up with a super elegant solution parsing json.
Install Chrome and get the plugin JSONView
Call the Jenkins API in your Chrome browser using https://$domain/$job/lastSuccessfulBuild/api/json
This will print out the key pair values in json. Denote your Key, for me it was number
brew install jq
In a bash script create a variable that will store the dynamic value as follows
This will store the build number to latest:
latest=$(curl --silent --show-error https://userIsMe:123myapikey321#jenkins.mycompany.com/job/build_ios/lastSuccessfulBuild/api/json | jq '.number')
Print it to screen if you would like:
echo $latest
Now with some string interpolation pass the variable for latest to your wget call:
wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/myCompany-$latest.ipa
Hope this helps someone out as there is limited info out there that is clear and concise, especially given that wget has been around for an eternity.

Wget span host only for images/stylesheets/javascript but not links

Wget has the -H "span host" option
Span to any host—‘-H’
The ‘-H’ option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.
I want to do a recursive download (say, of level 3), and I want to get images, stylesheets, javascripts, etc. (that is, files necessary to display the page properly) even if they're outside my host. However, I don't want to follow a link to another HTML page (because then it can go to another HTML page, and so on, then the number can explode.)
Is it possible to do this somehow? It seems like the -H option controls spanning to other hosts for both the images/stylesheets/javascript case and the link case, and wget doesn't allow me to separate the two.
Downloading All Dependencies in a page
First step is downloading all the resources of a particular page. If you look in the man pages for wget you will find this:
...to download a single page and all its requisites (even if they exist on
separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:
wget -E -H -k -K -p http://<site>/<document>
Getting Multiple Pages
Unfortunately, that only works per-page. You can turn on recursion with -r, but then you run into the issue of following external sites and blowing up. If you know the full list of domains that could be used for resources, you can limit it to just those using -D, but that might be hard to do. I recommend using a combination of -np (no parent directories) and -l to limit the depth of the recursion. You might start getting other sites, but it will at least be limited. If you start having issues, you could use --exclude-domains to limit the known problem causers. In the end, I think this is best:
wget -E -H -k -K -p -np -l 1 http://<site>/level
Limiting the domains
To help figure out what domains need to be included/excluded you could use this answer to grep a page or two (you would want to grep the .orig file) and list the links within them. From there you might be able to build a decent list of domains that should be included and limit it using the -D argument. Or you might at least find some domains that you don't want included and limit them using --exclude-domains. Finally, you can use the -Q argument to limit the amount of data downloaded as a safeguard to prevent filling up your disk.
Descriptions of the Arguments
-E
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
option will cause the suffix .html to be appended to the local filename.
-H
Enable spanning across hosts when doing recursive retrieving.
-k
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the
visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets,
hyperlinks to non-HTML content, etc.
-K
When converting a file, back up the original version with a .orig suffix.
-p
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as
inlined images, sounds, and referenced stylesheets.
-np
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files
below a certain hierarchy will be downloaded.
-l
Specify recursion maximum depth level depth.
-D
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on -H.
--exclude-domains
Specify the domains that are not to be followed.
-Q
Specify download quota for automatic retrievals. The value can be specified in bytes (default), kilobytes (with k suffix), or megabytes (with m suffix).
Just put wget -E -H -k -K -p -r http://<site>/ to download a complete site. Don't get nervous if while downloading you open some page and its resources are not available, because when wget finishes it all, it will convert them!
for downloading all "files necessary to display the page properly" you can use -p or --page-requisites, perhaps together with -Q or --quota
Try using the wget --accept-regex flag; the posix --regex-type is compiled into wget standard but you can compile in the perl regex engine pcre if you need something more elaborate:
E.g. The following will get all pngs on external sites one level deep and any other pages that have the word google in the url:
wget -r -H -k -l 1 --regex-type posix --accept-regex "(.*google.*|.*png)" "http://www.google.com"
It doesn't actually solve the problem of dipping down multiple levels on external sites, for that you would have to probably write your own spider. But using the --accept-regex you can probably get close to what you are looking for in most cases.
Within a single layer of a domain you can check all links internally, and on third party servers with the following command.
wget --spider -nd -e robots=off -Hprb --level=1 -o wget-log -nv http://localhost
The limitation here is that it only checks a single layer. This works well with a CMS where you can flatten the site with the GET variable rather than CMS generated URLs. Otherwise you can use your favorite server side script to loop this command through directories. For a full explanation of all of the options, check out this Github commit.
https://github.com/jonathan-smalls-cc/git-hooks/blob/LAMP/contrib/pre-commit/crawlDomain.sh

Getting %0D%0D at the end of url while accessing from unix

I am accessing a file kept on my svn repository from unix by wget command.
#!/bin/bash
ANTBUILDFILE=http://l09089r4.tst.poles.com:1808/svn/CommonMDM/trunk/Common/BuildArtifacts/VendorCatalog_Weblogic/build_CustomUI.xml
cd /tmp/install
wget -nc ${ANTBUILDFILE}
But I am getting the output as :
--2013-05-16 00:21:51-- http://l09089r4.tst.poles.com:1808/svn/CommonMDM/trunk/Common/BuildArtifacts/VendorCatalog_Weblogic/build_CustomUI.xml%0D%0D
wget: /home/tkmd999/.netrc:3: unknown token "ibm"
Resolving l09089r4.tst.poles.com... 10.8.91.58
Connecting to l09089r4.tst.poles.com|10.8.91.58|:18080... connected.
HTTP request sent, awaiting response... 404 Not Found
2013-05-16 00:21:51 ERROR 404: Not Found.
There are %0D%0D art the end of the url which is making it in accessible.
Post getting this error I have converted the file in question in url in unix format too, and commited my changes in the svn repositiry, bu still getting this error.
Any other ideas which I can follow to get rid of this error?>
Thanks,
Manish
The %0D you see are most likely remains of Windows-style CrLf Newlines in your shell script - one from the ANTBUILDFILE=... line, one from the wget ... line.
There can be a number of more or less subtle reasons for this, for example the svn:eol-style property:
TortoiseSVN sets svn:eol-style native style by default, trying to follow the convention of the client OS.
This can lead to confusion when using network shares accessible by several operating systems or tools that have different expectations on newlines.
If this turns out to be the situation you are experiencing, you can simply remove the svn:eol-style property from the file and commit it with the newline style you want.

Resources