cannot get 'wget --recursive' to work - https

I would like to download this page:
https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset
as well as its subpages, especially the .pdf documents:
https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset/MS-A0210_thursday_30_oct.pdf
https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset/MS-A0210_hints_for_w45.pdf
etc.
When I give this command:
$ wget --page-requisites --convert-links --recursive --level=0 --no-check-certificate --no-proxy -E -H -Dnoppa.aalto.fi -k https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset
I get:
$ ls -R
.:
noppa.aalto.fi
./noppa.aalto.fi:
noppa robots.txt
./noppa.aalto.fi/noppa:
kurssi
./noppa.aalto.fi/noppa/kurssi:
ms-a0210
./noppa.aalto.fi/noppa/kurssi/ms-a0210:
viikkoharjoitukset.html
I have tried several wget options, with no luck.
What could be the problem?

By default, wget adheres to robots.txt files, which, in this case, disallows all access:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Disallow: /cgi-bin/
If you add -e robots=off to your command line, wget will not care for a robots.txt file.

Related

is it possible for Wget to flatten the result directories?

Is there any way to make wget output everything in a single flat folder? right now i'm
wget --quiet -E -H -k -K -p -e robots=off #{url}
but i'm getting everything in the same nested way as it is on the site, is there any option to flatten the resulting folder structure? (and also the sources links on the index.html file)
After reading on the documentation and some examples i found that i was missing the -nd flag, that would make wget just get the files and not the directories
correct call wget --quiet -E -H -k -nd -K -p -e robots=off #{url}

Wget not downloading the url given to him

My wget request is
wget --reject jpg,png,css,js,svg,gif --convert-links -e robots=off --content-disposition --timestamping --recursive --domains appbase.io --no-parent --output-file=logfile --limit-rate=200k -w 3 --random-wait docs.appbase.io
On the docs.appbase.io page, there are two different type of a href
v2.0
v3.0
The first link (v2.0) is recursively download but not the v3.0
What should I do to recursively download the full URL as well?

WGET ignoring --content-disposition?

I am trying to run a command to download 3000 files in parallel. I am using Cygwin + Windows.
Downloading a single file via WGET in terminal :
wget --no-check-certificate --content-disposition --load-cookies cookies.txt \ -p https://username:password#website.webpage.com/folder/document/download/1?type=file
allows me to download the file with ID 1 singularly, in the correct format (as long as --content-disposition is in the command).
I iterate over this REST API call to download the entire folder (3000 files). This works OK, but is quite slow.
FOR /L %i in (0,1,3000) do wget --no-check-certificate --content-disposition --load-cookies cookies.txt \ -p https://username:password#website.webpage.com/folder/document/download/%i?type=file
Now I am trying to run the program in Cygwin, in parallel.
seq 3000 | parallel -j 200 wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.webpage.com/folder/document/download/{}?type=file
It runs, but the file-name and format is lost (instead of "index.html", for example, we may get "4#type=file" as the file-name).
Is there a way for me to fix this?
It is unclear what you would like them named. Let us assume you want them named: index.[1-3000].html
seq 3000 | parallel -j 200 wget -O index.{}.html --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.webpage.com/folder/document/download/{}?type=file
My guess is that it is caused by --content-disposition being experimental, and the wget used by CygWin may be older than the wget used by the FOR loop. To check that run:
wget --version
in CygWin and outside CygWin (ie. where you would run the FOR loop).

How to download multiple URLs using wget using a single command?

I am using following command to download a single webpage with all its images and js using wget in Windows 7:
wget -E -H -k -K -p -e robots=off -P /Downloads/ http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
It is downloading the HTML as required, but when I tried to pass on a text file having a list of 3 URLs to download, it didn't give any output, below is the command I am using:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt -B 'http://'
I tried this also:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
This text file had URLs http:// prepended in it.
list.txt contains list of 3 URLs which I need to download using a single command. Please help me in resolving this issue.
From man wget:
2 Invoking
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
So, just use multiple URLs:
wget URL1 URL2
Or using the links from comments:
$ cat list.txt
http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
http://www.verizonwireless.com/smartphones-2.shtml
http://www.att.com/shop/wireless/devices/smartphones.html
and your command line:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
works as expected.
First create a text file with the URLs that you need to download.
eg: download.txt
download.txt will as below:
http://www.google.com
http://www.yahoo.com
then use the command wget -i download.txt to download the files. You can add many URLs to the text file.
If you have a list of URLs separated on multiple lines like this:
http://example.com/a
http://example.com/b
http://example.com/c
but you don't want to create a file and point wget to it, you can do this:
wget -i - <<< 'http://example.com/a
http://example.com/b
http://example.com/c'
pedantic version:
for x in {'url1','url2'}; do wget $x; done
the advantage of it you can treat is as a single wget url command

wget download aspx page

I want to download the web page http://www.codeproject.com/KB/tips/ModelViewController.aspx using wget, so I simply used the really basic command:
wget http://www.codeproject.com/KB/tips/ModelViewController.aspx
What I received was a file with the .aspx extension, which could not be displayed correctly in a regular browser.
How can I download that web page?
Courtesy of the wget manual page (first result of a web search on "wget options", btw):
wget -E http://whatever.url.example.com/x/y/z/foo.aspx
If you also wish to download all related media (CSS, images, etc.), use -p, possibly with --convert-links (rewrites the page for offline viewing):
wget -Ep --convert-links http://whatever.url.example.com/x/y/z/foo.aspx
The file will actually be displaying correctly, you can rename it to a .html file and you will be able to confirm this. The server-side technology used by the webserver doesn't effect wget
Edit: my comments below this were wrong, thanks for the commentor for pointing it out, have removed them for future readers
$ wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains example.org \
--no-parent \
www.example.org/tutorials/html/
from this page : http://www.linuxjournal.com/content/downloading-entire-web-site-wget

Resources