how to avoid links to be downloaded using wget - download

I am trying to download some page of the following site http://computerone.altervista.org, just for testing…
My goal is to download just the pages matching the following patterns "*JavaScript*" and "*index*".
Actually if I try the following options
wget \
-A "*Javascript*, *index*" \
--exclude-domains http://computerone.altervista.org/rss-articles/ \
-e robots=off \
--mirror -E -k -p -np -nc --convert-links \
--wait=5 -c \
http://computerone.altervista.org
it works expect the fact it tries to download http://computerone.altervista.org/rss-articles/ too.
My questions are:
why it tries to download the http://computerone.altervista.org/rss-articles/ page?
how should I avoid it? I tried --exclude-domains http://computerone.altervista.org/rss-articles/ option, but it try to download it
P.S.:
Looking to the source page I get:
<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="rss-articles/" />

wget -p downloads all page requisites:
man wget:
To finish off this topic, it's worth knowing that Wget's idea of an
external document link is any URL specified in an <A> tag, an
<AREA> tag, or a <LINK> tag other than <LINK
REL="stylesheet">.
to exclude rss-articles use -X or --exclude-directories
wget -A "*Javascript*, *index*" -X "rss-articles" -e robots=off --mirror -E -k -p -np -nc -c http://computerone.altervista.org

Related

is it possible for Wget to flatten the result directories?

Is there any way to make wget output everything in a single flat folder? right now i'm
wget --quiet -E -H -k -K -p -e robots=off #{url}
but i'm getting everything in the same nested way as it is on the site, is there any option to flatten the resulting folder structure? (and also the sources links on the index.html file)
After reading on the documentation and some examples i found that i was missing the -nd flag, that would make wget just get the files and not the directories
correct call wget --quiet -E -H -k -nd -K -p -e robots=off #{url}

wget do not download subirectories only all files in specified directory [duplicate]

I am trying to download the files for a project using wget, as the SVN server for that project isn't running anymore and I am only able to access the files through a browser. The base URLs for all the files is the same like
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/*
How can I use wget (or any other similar tool) to download all the files in this repository, where the "tzivi" folder is the root folder and there are several files and sub-folders (upto 2 or 3 levels) under it?
You may use this in shell:
wget -r --no-parent http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r //recursive Download
and
--no-parent // Don´t download something from the parent directory
If you don't want to download the entire content, you may use:
-l1 just download the directory (tzivi in your case)
-l2 download the directory and all level 1 subfolders ('tzivi/something' but not 'tivizi/somthing/foo')
And so on. If you insert no -l option, wget will use -l 5 automatically.
If you insert a -l 0 you´ll download the whole Internet, because wget will follow every link it finds.
You can use this in a shell:
wget -r -nH --cut-dirs=7 --reject="index.html*" \
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r recursively download
-nH (--no-host-directories) cuts out hostname
--cut-dirs=X (cuts out X directories)
This link just gave me the best answer:
$ wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off -U mozilla http://base.site/dir/
Worked like a charm.
wget -r --no-parent URL --user=username --password=password
the last two options are optional if you have the username and password for downloading, otherwise no need to use them.
You can also see more options in the link https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/
use the command
wget -m www.ilanni.com/nexus/content/
you can also use this command :
wget --mirror -pc --convert-links -P ./your-local-dir/ http://www.your-website.com
so that you get the exact mirror of the website you want to download
try this working code (30-08-2021):
!wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off --adjust-extension -U mozilla "yourweb directory with in quotations"
I can't get this to work.
Whatever I try, I just get some http file.
Just looking at these commands for simply downloading a directory?
There must be a better way.
wget seems the wrong tool for this task, unless it is a complete failure.
This works:
wget -m -np -c --no-check-certificate -R "index.html*" "https://the-eye.eu/public/AudioBooks/Edgar%20Allan%20Poe%20-%2"
This will help
wget -m -np -c --level 0 --no-check-certificate -R"index.html*"http://www.your-websitepage.com/dir

Wget not downloading the url given to him

My wget request is
wget --reject jpg,png,css,js,svg,gif --convert-links -e robots=off --content-disposition --timestamping --recursive --domains appbase.io --no-parent --output-file=logfile --limit-rate=200k -w 3 --random-wait docs.appbase.io
On the docs.appbase.io page, there are two different type of a href
v2.0
v3.0
The first link (v2.0) is recursively download but not the v3.0
What should I do to recursively download the full URL as well?

How to download multiple URLs using wget using a single command?

I am using following command to download a single webpage with all its images and js using wget in Windows 7:
wget -E -H -k -K -p -e robots=off -P /Downloads/ http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
It is downloading the HTML as required, but when I tried to pass on a text file having a list of 3 URLs to download, it didn't give any output, below is the command I am using:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt -B 'http://'
I tried this also:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
This text file had URLs http:// prepended in it.
list.txt contains list of 3 URLs which I need to download using a single command. Please help me in resolving this issue.
From man wget:
2 Invoking
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
So, just use multiple URLs:
wget URL1 URL2
Or using the links from comments:
$ cat list.txt
http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
http://www.verizonwireless.com/smartphones-2.shtml
http://www.att.com/shop/wireless/devices/smartphones.html
and your command line:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
works as expected.
First create a text file with the URLs that you need to download.
eg: download.txt
download.txt will as below:
http://www.google.com
http://www.yahoo.com
then use the command wget -i download.txt to download the files. You can add many URLs to the text file.
If you have a list of URLs separated on multiple lines like this:
http://example.com/a
http://example.com/b
http://example.com/c
but you don't want to create a file and point wget to it, you can do this:
wget -i - <<< 'http://example.com/a
http://example.com/b
http://example.com/c'
pedantic version:
for x in {'url1','url2'}; do wget $x; done
the advantage of it you can treat is as a single wget url command

wget download aspx page

I want to download the web page http://www.codeproject.com/KB/tips/ModelViewController.aspx using wget, so I simply used the really basic command:
wget http://www.codeproject.com/KB/tips/ModelViewController.aspx
What I received was a file with the .aspx extension, which could not be displayed correctly in a regular browser.
How can I download that web page?
Courtesy of the wget manual page (first result of a web search on "wget options", btw):
wget -E http://whatever.url.example.com/x/y/z/foo.aspx
If you also wish to download all related media (CSS, images, etc.), use -p, possibly with --convert-links (rewrites the page for offline viewing):
wget -Ep --convert-links http://whatever.url.example.com/x/y/z/foo.aspx
The file will actually be displaying correctly, you can rename it to a .html file and you will be able to confirm this. The server-side technology used by the webserver doesn't effect wget
Edit: my comments below this were wrong, thanks for the commentor for pointing it out, have removed them for future readers
$ wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains example.org \
--no-parent \
www.example.org/tutorials/html/
from this page : http://www.linuxjournal.com/content/downloading-entire-web-site-wget

Resources