Is there any way to make wget output everything in a single flat folder? right now i'm
wget --quiet -E -H -k -K -p -e robots=off #{url}
but i'm getting everything in the same nested way as it is on the site, is there any option to flatten the resulting folder structure? (and also the sources links on the index.html file)
After reading on the documentation and some examples i found that i was missing the -nd flag, that would make wget just get the files and not the directories
correct call wget --quiet -E -H -k -nd -K -p -e robots=off #{url}
Related
I am trying to download the files for a project using wget, as the SVN server for that project isn't running anymore and I am only able to access the files through a browser. The base URLs for all the files is the same like
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/*
How can I use wget (or any other similar tool) to download all the files in this repository, where the "tzivi" folder is the root folder and there are several files and sub-folders (upto 2 or 3 levels) under it?
You may use this in shell:
wget -r --no-parent http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r //recursive Download
and
--no-parent // Don´t download something from the parent directory
If you don't want to download the entire content, you may use:
-l1 just download the directory (tzivi in your case)
-l2 download the directory and all level 1 subfolders ('tzivi/something' but not 'tivizi/somthing/foo')
And so on. If you insert no -l option, wget will use -l 5 automatically.
If you insert a -l 0 you´ll download the whole Internet, because wget will follow every link it finds.
You can use this in a shell:
wget -r -nH --cut-dirs=7 --reject="index.html*" \
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r recursively download
-nH (--no-host-directories) cuts out hostname
--cut-dirs=X (cuts out X directories)
This link just gave me the best answer:
$ wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off -U mozilla http://base.site/dir/
Worked like a charm.
wget -r --no-parent URL --user=username --password=password
the last two options are optional if you have the username and password for downloading, otherwise no need to use them.
You can also see more options in the link https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/
use the command
wget -m www.ilanni.com/nexus/content/
you can also use this command :
wget --mirror -pc --convert-links -P ./your-local-dir/ http://www.your-website.com
so that you get the exact mirror of the website you want to download
try this working code (30-08-2021):
!wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off --adjust-extension -U mozilla "yourweb directory with in quotations"
I can't get this to work.
Whatever I try, I just get some http file.
Just looking at these commands for simply downloading a directory?
There must be a better way.
wget seems the wrong tool for this task, unless it is a complete failure.
This works:
wget -m -np -c --no-check-certificate -R "index.html*" "https://the-eye.eu/public/AudioBooks/Edgar%20Allan%20Poe%20-%2"
This will help
wget -m -np -c --level 0 --no-check-certificate -R"index.html*"http://www.your-websitepage.com/dir
I am trying to download all of the files in a directory using:
wget -r -N --no-parent -nH -P /media/karunakar --ftp-user=jsjd --ftp-password='hdshd' ftp://ftp.xyz.com/Suppliers/my/ORD20130908
but wget is fetching files from the parent directory, even though I specified --no-parent. I only want the files in ORD20130908.
You need to add a trailing slash to indicate the last item in the URL is a directory and not a file:
wget -r -N --no-parent -nH -P /media/karunakar --ftp-user=jsjd --ftp-password='hdshd' ftp://ftp.xyz.com/Suppliers/my/ORD20130908
↓
wget -r -N --no-parent -nH -P /media/karunakar --ftp-user=jsjd --ftp-password='hdshd' ftp://ftp.xyz.com/Suppliers/my/ORD20130908/
From the documentation:
Note that, for HTTP (and HTTPS), the trailing slash is very important to ‘--no-parent’. HTTP has no concept of a “directory”—Wget relies on you to indicate what’s a directory and what isn’t. In ‘http://foo/bar/’, Wget will consider ‘bar’ to be a directory, while in ‘http://foo/bar’ (no trailing slash), ‘bar’ will be considered a filename (so ‘--no-parent’ would be meaningless, as its parent is ‘/’).
wget -r -N --no-parent -nH -P /media/karunakar --ftp-user=jsjd --ftp-password='hdshd' -I/ORD20130908 ftp://ftp.xyz.com/Suppliers/my
See wget document for the use of -I flag
I am using following command to download a single webpage with all its images and js using wget in Windows 7:
wget -E -H -k -K -p -e robots=off -P /Downloads/ http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
It is downloading the HTML as required, but when I tried to pass on a text file having a list of 3 URLs to download, it didn't give any output, below is the command I am using:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt -B 'http://'
I tried this also:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
This text file had URLs http:// prepended in it.
list.txt contains list of 3 URLs which I need to download using a single command. Please help me in resolving this issue.
From man wget:
2 Invoking
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
So, just use multiple URLs:
wget URL1 URL2
Or using the links from comments:
$ cat list.txt
http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
http://www.verizonwireless.com/smartphones-2.shtml
http://www.att.com/shop/wireless/devices/smartphones.html
and your command line:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
works as expected.
First create a text file with the URLs that you need to download.
eg: download.txt
download.txt will as below:
http://www.google.com
http://www.yahoo.com
then use the command wget -i download.txt to download the files. You can add many URLs to the text file.
If you have a list of URLs separated on multiple lines like this:
http://example.com/a
http://example.com/b
http://example.com/c
but you don't want to create a file and point wget to it, you can do this:
wget -i - <<< 'http://example.com/a
http://example.com/b
http://example.com/c'
pedantic version:
for x in {'url1','url2'}; do wget $x; done
the advantage of it you can treat is as a single wget url command
I want to download this webpage using wget in Win7 : http://www.att.com/shop/wireless/devices/smartphones.deviceListView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?commitmentTerm=24&taxoStyle=SMARTPHONES&showMoreListSize=1000
I am using this command to do this :
wget -E -H -k -K -p -e robots=off -P /Downloads/AT&T_2013-01-29/ http://www.att.com/shop/wireless/devices/smartphones.deviceListView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?commitmentTerm=24&taxoStyle=SMARTPHONES&showMoreListSize=1000
I am getting taxostyle not defined, commitmentterm not defined or recognizble method error
Add quotes around address
wget -E -H -k -K -p -e robots=off -P "/Downloads/AT&T_2013-01-29/" "http://www.att.com/shop/wireless/devices/smartphones.deviceListView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?commitmentTerm=24&taxoStyle=SMARTPHONES&showMoreListSize=1000"
& is used as command separator in command window
I am trying to download some page of the following site http://computerone.altervista.org, just for testing…
My goal is to download just the pages matching the following patterns "*JavaScript*" and "*index*".
Actually if I try the following options
wget \
-A "*Javascript*, *index*" \
--exclude-domains http://computerone.altervista.org/rss-articles/ \
-e robots=off \
--mirror -E -k -p -np -nc --convert-links \
--wait=5 -c \
http://computerone.altervista.org
it works expect the fact it tries to download http://computerone.altervista.org/rss-articles/ too.
My questions are:
why it tries to download the http://computerone.altervista.org/rss-articles/ page?
how should I avoid it? I tried --exclude-domains http://computerone.altervista.org/rss-articles/ option, but it try to download it
P.S.:
Looking to the source page I get:
<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="rss-articles/" />
wget -p downloads all page requisites:
man wget:
To finish off this topic, it's worth knowing that Wget's idea of an
external document link is any URL specified in an <A> tag, an
<AREA> tag, or a <LINK> tag other than <LINK
REL="stylesheet">.
to exclude rss-articles use -X or --exclude-directories
wget -A "*Javascript*, *index*" -X "rss-articles" -e robots=off --mirror -E -k -p -np -nc -c http://computerone.altervista.org