how do I download a large number of zip files with wget to a url - bash

At the url here there is a large number of zip files that I need to download and save to the test/files/downloads directory. I'm using wget with the prompt
wget -i http://bitly.com/nuvi-plz -P test/files/downloads
and It downloads the whole page into a file inside the directory and starts downloading each zip file but then gives me a 404 for each file that looks something like
2016-05-12 17:12:28-- http://bitly.com/1462835080018.zip
Connecting to bitly.com|69.58.188.33|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bitly.com/1462835080018.zip [following]
--2016-05-12 17:12:28-- https://bitly.com/1462835080018.zip
Connecting to bitly.com|69.58.188.33|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-05-12 17:12:29 ERROR 404: Not Found.
How can I get wget to download all the zip files on the page properly?

You need to get the redirect from bit.ly and then download all files. This is real ugly, but it worked:
wget http://bitly.com/nuvi-plz --server-response -O /dev/null 2>&1 | \
awk '(NR==1){SRC=$3;} /^ Location: /{DEST=$2} END{ print SRC, DEST}' | sed 's|.*http|http|' | \
while read url; do
wget -A zip -r -l 1 -nd $url -P test/files/downloads
done
If you use the direct link, this will work:
wget -A zip -r -l 1 -nd http://feed.omgili.com/5Rh5AMTrc4Pv/mainstream/posts/ -P test/files/downloads

Related

Unix Wget | # in URL | Syntax Issue

what should be the wget command if the URL contain #
Ex of my url
https://tableau.abc.intranet/#/site/QQ/views/Myreport/DailyReport.csv
Command
wget -P /temp "https://tableau.abc.intranet/#/site/QQ/views/Myreport/DailyReport.csv"
however wget is considering the url till intrenet/ after that it is not taking .

Wget not downloading the url given to him

My wget request is
wget --reject jpg,png,css,js,svg,gif --convert-links -e robots=off --content-disposition --timestamping --recursive --domains appbase.io --no-parent --output-file=logfile --limit-rate=200k -w 3 --random-wait docs.appbase.io
On the docs.appbase.io page, there are two different type of a href
v2.0
v3.0
The first link (v2.0) is recursively download but not the v3.0
What should I do to recursively download the full URL as well?

using wget to download a directory

I'm trying to download all the files in an online directory. The command I'm using is:
wget -r -np -nH -R index.html
http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
Using this command I get an empty directory. If I specify file names at the end I can get one at a time, but I'd like to get them all at once. Am I just missing something simple?
output from command:
--2015-03-14 14:54:05-- http://www.oecd-nea.org/dbforms/data/evaevatapes/mendl_2/
Resolving www.oecd-nea.org... 193.51.64.80
Connecting to www.oecd-nea.org|193.51.64.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: âdbforms/data/eva/evatapes/mendl_2/index.htmlâdbforms/data/eva/evatapes/mendl_2/index.htmlârobots.txtârobots.txt
Add the depth of links you want to follow (-l1, since you only want to follow one link):
wget -e robots=off -l1 -r -np -nH -R index.html http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
I also added -e robots=off, since there is a robots.txt which would normally stop wget from going through that directory. For the rest of the world:
-r recursive,
-np no parent directory
-nH no spanning across hosts

How to download all images from a website using wget?

Here is an example of my command:
wget -r -l 0 -np -t 1 -A jpg,jpeg,gif,png -nd --connect-timeout=10 -P ~/support --load-cookies cookies.txt "http://support.proboards.com/" -e robots=off
Based on the input here
But nothing really gets downloaded, no recursive crawling, it takes just a few seconds to complete. I am trying to backup all images from a forum, is the forum structure causing issues?
wget -r -P /download/location -A jpg,jpeg,gif,png http://www.site.here
works like a charm
Download image file with another name.
Here I provide the wget.zip file name as shown below.
# wget -O wget.zip http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz
--2012-10-02 11:55:54-- http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz
Resolving ftp.gnu.org... 208.118.235.20, 2001:4830:134:3::b
Connecting to ftp.gnu.org|208.118.235.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 446966 (436K) [application/x-gzip]
Saving to: wget.zip
100%[===================================================================================>] 446,966 60.0K/s in 7.5s
2012-10-02 11:56:02 (58.5 KB/s) - wget.zip

List of new files downloaded with wget

I'm using wget to download new files from a FTP server. The new files that have been downloaded need to be processed by another script.
wget -N -r ftp://server/folder
So my question is: How do i get a list of all files that wget has been downloaded?
Thanks in advance!
You can use wget's -o flag to pipe the standard output to a logfile. The logfile will have the same format as the regular output to a terminal, for example:
--2012-06-28 17:57:13-- http://cdn.sstatic.net/stackoverflow/img/sprites.png
Resolving cdn.sstatic.net (cdn.sstatic.net)... 67.201.31.70
Connecting to cdn.sstatic.net (cdn.sstatic.net)|67.201.31.70|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16425 (16K) [image/png]
Saving to: `sprites.png'
0K .......... ...... 100% 131K=0.1s
2012-06-28 17:57:13 (131 KB/s) - `sprites.png' saved [16425/16425]
If you pipe this file through egrep as egrep -e "--" logfile.txt you will get only the lines specifying which files were downloaded, with timestamps.
--2012-06-28 17:57:13-- http://cdn.sstatic.net/stackoverflow/img/sprites.png
If you wish, you can then pipe it to cut egrep -e "--" logfile.txt|cut -d ' ' -f 4 to get only to downloaded files.
http://cdn.sstatic.net/stackoverflow/img/sprites.png
You should add error-checks in here as well, but this is the basic outline.

Resources