using wget to download a directory - bash

I'm trying to download all the files in an online directory. The command I'm using is:
wget -r -np -nH -R index.html
http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
Using this command I get an empty directory. If I specify file names at the end I can get one at a time, but I'd like to get them all at once. Am I just missing something simple?
output from command:
--2015-03-14 14:54:05-- http://www.oecd-nea.org/dbforms/data/evaevatapes/mendl_2/
Resolving www.oecd-nea.org... 193.51.64.80
Connecting to www.oecd-nea.org|193.51.64.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: âdbforms/data/eva/evatapes/mendl_2/index.htmlâdbforms/data/eva/evatapes/mendl_2/index.htmlârobots.txtârobots.txt

Add the depth of links you want to follow (-l1, since you only want to follow one link):
wget -e robots=off -l1 -r -np -nH -R index.html http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
I also added -e robots=off, since there is a robots.txt which would normally stop wget from going through that directory. For the rest of the world:
-r recursive,
-np no parent directory
-nH no spanning across hosts

Related

wget do not download subirectories only all files in specified directory [duplicate]

I am trying to download the files for a project using wget, as the SVN server for that project isn't running anymore and I am only able to access the files through a browser. The base URLs for all the files is the same like
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/*
How can I use wget (or any other similar tool) to download all the files in this repository, where the "tzivi" folder is the root folder and there are several files and sub-folders (upto 2 or 3 levels) under it?
You may use this in shell:
wget -r --no-parent http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r //recursive Download
and
--no-parent // Don´t download something from the parent directory
If you don't want to download the entire content, you may use:
-l1 just download the directory (tzivi in your case)
-l2 download the directory and all level 1 subfolders ('tzivi/something' but not 'tivizi/somthing/foo')
And so on. If you insert no -l option, wget will use -l 5 automatically.
If you insert a -l 0 you´ll download the whole Internet, because wget will follow every link it finds.
You can use this in a shell:
wget -r -nH --cut-dirs=7 --reject="index.html*" \
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r recursively download
-nH (--no-host-directories) cuts out hostname
--cut-dirs=X (cuts out X directories)
This link just gave me the best answer:
$ wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off -U mozilla http://base.site/dir/
Worked like a charm.
wget -r --no-parent URL --user=username --password=password
the last two options are optional if you have the username and password for downloading, otherwise no need to use them.
You can also see more options in the link https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/
use the command
wget -m www.ilanni.com/nexus/content/
you can also use this command :
wget --mirror -pc --convert-links -P ./your-local-dir/ http://www.your-website.com
so that you get the exact mirror of the website you want to download
try this working code (30-08-2021):
!wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off --adjust-extension -U mozilla "yourweb directory with in quotations"
I can't get this to work.
Whatever I try, I just get some http file.
Just looking at these commands for simply downloading a directory?
There must be a better way.
wget seems the wrong tool for this task, unless it is a complete failure.
This works:
wget -m -np -c --no-check-certificate -R "index.html*" "https://the-eye.eu/public/AudioBooks/Edgar%20Allan%20Poe%20-%2"
This will help
wget -m -np -c --level 0 --no-check-certificate -R"index.html*"http://www.your-websitepage.com/dir

how do I download a large number of zip files with wget to a url

At the url here there is a large number of zip files that I need to download and save to the test/files/downloads directory. I'm using wget with the prompt
wget -i http://bitly.com/nuvi-plz -P test/files/downloads
and It downloads the whole page into a file inside the directory and starts downloading each zip file but then gives me a 404 for each file that looks something like
2016-05-12 17:12:28-- http://bitly.com/1462835080018.zip
Connecting to bitly.com|69.58.188.33|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bitly.com/1462835080018.zip [following]
--2016-05-12 17:12:28-- https://bitly.com/1462835080018.zip
Connecting to bitly.com|69.58.188.33|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-05-12 17:12:29 ERROR 404: Not Found.
How can I get wget to download all the zip files on the page properly?
You need to get the redirect from bit.ly and then download all files. This is real ugly, but it worked:
wget http://bitly.com/nuvi-plz --server-response -O /dev/null 2>&1 | \
awk '(NR==1){SRC=$3;} /^ Location: /{DEST=$2} END{ print SRC, DEST}' | sed 's|.*http|http|' | \
while read url; do
wget -A zip -r -l 1 -nd $url -P test/files/downloads
done
If you use the direct link, this will work:
wget -A zip -r -l 1 -nd http://feed.omgili.com/5Rh5AMTrc4Pv/mainstream/posts/ -P test/files/downloads

How to download all images from a website using wget?

Here is an example of my command:
wget -r -l 0 -np -t 1 -A jpg,jpeg,gif,png -nd --connect-timeout=10 -P ~/support --load-cookies cookies.txt "http://support.proboards.com/" -e robots=off
Based on the input here
But nothing really gets downloaded, no recursive crawling, it takes just a few seconds to complete. I am trying to backup all images from a forum, is the forum structure causing issues?
wget -r -P /download/location -A jpg,jpeg,gif,png http://www.site.here
works like a charm
Download image file with another name.
Here I provide the wget.zip file name as shown below.
# wget -O wget.zip http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz
--2012-10-02 11:55:54-- http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz
Resolving ftp.gnu.org... 208.118.235.20, 2001:4830:134:3::b
Connecting to ftp.gnu.org|208.118.235.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 446966 (436K) [application/x-gzip]
Saving to: wget.zip
100%[===================================================================================>] 446,966 60.0K/s in 7.5s
2012-10-02 11:56:02 (58.5 KB/s) - wget.zip

Download data from FTP Website

There is something which I am missing or might be the whole case. So I am trying to download NCDC data from NCDC Datasets and unable to do it the unix box.
The command which I have used this far are
wget ftp://ftp.ncdc.noaa.gov:21/pub/data/noaa/1901/029070-99999-1901.gz">029070-99999-1901.gz
This is for one file, but will be very happy if I can downlaod the entire parent directory.
You seem to have a lonely " just before the >
to download everything you can try this command to get the whole directory content
wget -r ftp://ftp.ncdc.noaa.gov:21/pub/data/noaa/1901/*
for i in {1990..1993}
do
echo "$i"
cd /home/chile/data
# -nH Disable generation of host-prefixed directories
# -nd all files will get saved to the current directory
# -np Do not ever ascend to the parent directory when retrieving recursively.
# -R index.html*,227070*.gz* don't download files with this regex
wget -r -nH -nd -np -R *.html,999999-99999-$i.gz* http://www1.ncdc.noaa.gov/pub/data/noaa/$i/
/data/noaa/$i/
done

Using wget to recursively fetch a directory with arbitrary files in it

I have a web directory where I store some config files. I'd like to use wget to pull those files down and maintain their current structure. For instance, the remote directory looks like:
http://mysite.com/configs/.vim/
.vim holds multiple files and directories. I want to replicate that on the client using wget. Can't seem to find the right combo of wget flags to get this done. Any ideas?
You have to pass the -np/--no-parent option to wget (in addition to -r/--recursive, of course), otherwise it will follow the link in the directory index on my site to the parent directory. So the command would look like this:
wget --recursive --no-parent http://example.com/configs/.vim/
To avoid downloading the auto-generated index.html files, use the -R/--reject option:
wget -r -np -R "index.html*" http://example.com/configs/.vim/
To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :
wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data
For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:
wget -e robots=off http://www.example.com/
http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html
You should use the -m (mirror) flag, as that takes care to not mess with timestamps and to recurse indefinitely.
wget -m http://example.com/configs/.vim/
If you add the points mentioned by others in this thread, it would be:
wget -m -e robots=off --no-parent http://example.com/configs/.vim/
Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):
wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/
If --no-parent not help, you might use --include option.
Directory struct:
http://<host>/downloads/good
http://<host>/downloads/bad
And you want to download downloads/good but not downloads/bad directory:
wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good
wget -r http://mysite.com/configs/.vim/
works for me.
Perhaps you have a .wgetrc which is interfering with it?
First of all, thanks to everyone who posted their answers. Here is my "ultimate" wget script to download a website recursively:
wget --recursive ${comment# self-explanatory} \
--no-parent ${comment# will not crawl links in folders above the base of the URL} \
--convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
--random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
--no-host-directories ${comment# do not create folders with the domain name} \
--execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
--level=inf --accept '*' ${comment# do not limit to 5 levels or common file formats} \
--reject="index.html*" ${comment# use this option if you need an exact mirror} \
--cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL
Afterwards, stripping the query params from URLs like main.css?crc=12324567 and running a local server (e.g. via python3 -m http.server in the dir you just wget'ed) to run JS may be necessary. Please note that the --convert-links option kicks in only after the full crawl was completed.
Also, if you are trying to wget a website that may go down soon, you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue.
To fetch a directory recursively with username and password, use the following command:
wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/
This version downloads recursively and doesn't create parent directories.
wgetod() {
NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}
Usage:
Add to ~/.bashrc or paste into terminal
wgetod "http://example.com/x/"
The following option seems to be the perfect combination when dealing with recursive download:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:
wget -r --no-parent http://example.com/configs/.vim/
That's it. It will download into the following local tree: ./example.com/configs/.vim .
However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:
wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/
And it will download your file tree only into ./.vim/
In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.
It sounds like you're trying to get a mirror of your file. While wget has some interesting FTP and SFTP uses, a simple mirror should work. Just a few considerations to make sure you're able to download the file properly.
Respect robots.txt
Ensure that if you have a /robots.txt file in your public_html, www, or configs directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:
wget -e robots=off 'http://your-site.com/configs/.vim/'
Convert remote links to local files.
Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.
Try this:
wget -mpEk 'http://your-site.com/configs/.vim/'
# If robots.txt is present:
wget -mpEk robots=off 'http://your-site.com/configs/.vim/'
# Good practice to only deal with the highest level directory you specify (instead of downloading all of `mysite.com` you're just mirroring from `.vim`
wget -mpEk robots=off --no-parent 'http://your-site.com/configs/.vim/'
Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files.
Since you should have a link set up, you should get your config folder with a file /.vim.
Mirror mode also works with a directory structure that's set up as an ftp:// also.
General rule of thumb:
Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads.
wget -mpEk --no-parent robots=off --random-wait 'http://your-site.com/configs/.vim/'
But if you're simply downloading the ../config/.vim/ file you shouldn't have to worry about it as your ignoring parent directories and downloading a single file.
Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...
wget --recursive (...)
...only retrieves index.html instead of all files.
Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.
Recursive wget ignoring robots (for websites)
wget -e robots=off -r -np --page-requisites --convert-links 'http://example.com/folder/'
-e robots=off causes it to ignore robots.txt for that domain
-r makes it recursive
-np = no parents, so it doesn't follow links up to the parent folder
You should be able to do it simply by adding a -r
wget -r http://stackoverflow.com/

Resources