List of new files downloaded with wget - bash

I'm using wget to download new files from a FTP server. The new files that have been downloaded need to be processed by another script.
wget -N -r ftp://server/folder
So my question is: How do i get a list of all files that wget has been downloaded?
Thanks in advance!

You can use wget's -o flag to pipe the standard output to a logfile. The logfile will have the same format as the regular output to a terminal, for example:
--2012-06-28 17:57:13-- http://cdn.sstatic.net/stackoverflow/img/sprites.png
Resolving cdn.sstatic.net (cdn.sstatic.net)... 67.201.31.70
Connecting to cdn.sstatic.net (cdn.sstatic.net)|67.201.31.70|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16425 (16K) [image/png]
Saving to: `sprites.png'
0K .......... ...... 100% 131K=0.1s
2012-06-28 17:57:13 (131 KB/s) - `sprites.png' saved [16425/16425]
If you pipe this file through egrep as egrep -e "--" logfile.txt you will get only the lines specifying which files were downloaded, with timestamps.
--2012-06-28 17:57:13-- http://cdn.sstatic.net/stackoverflow/img/sprites.png
If you wish, you can then pipe it to cut egrep -e "--" logfile.txt|cut -d ' ' -f 4 to get only to downloaded files.
http://cdn.sstatic.net/stackoverflow/img/sprites.png
You should add error-checks in here as well, but this is the basic outline.

Related

how do I download a large number of zip files with wget to a url

At the url here there is a large number of zip files that I need to download and save to the test/files/downloads directory. I'm using wget with the prompt
wget -i http://bitly.com/nuvi-plz -P test/files/downloads
and It downloads the whole page into a file inside the directory and starts downloading each zip file but then gives me a 404 for each file that looks something like
2016-05-12 17:12:28-- http://bitly.com/1462835080018.zip
Connecting to bitly.com|69.58.188.33|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bitly.com/1462835080018.zip [following]
--2016-05-12 17:12:28-- https://bitly.com/1462835080018.zip
Connecting to bitly.com|69.58.188.33|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-05-12 17:12:29 ERROR 404: Not Found.
How can I get wget to download all the zip files on the page properly?
You need to get the redirect from bit.ly and then download all files. This is real ugly, but it worked:
wget http://bitly.com/nuvi-plz --server-response -O /dev/null 2>&1 | \
awk '(NR==1){SRC=$3;} /^ Location: /{DEST=$2} END{ print SRC, DEST}' | sed 's|.*http|http|' | \
while read url; do
wget -A zip -r -l 1 -nd $url -P test/files/downloads
done
If you use the direct link, this will work:
wget -A zip -r -l 1 -nd http://feed.omgili.com/5Rh5AMTrc4Pv/mainstream/posts/ -P test/files/downloads

wget to parse a webpage in shell

I am trying to extract URLS from a webpage using wget. I tried this
wget -r -l2 --reject=gif -O out.html www.google.com | sed -n 's/.*href="\([^"]*\).*/\1/p'
It is displaiyng FINISHED
Downloaded: 18,472 bytes in 1 files
But not displaying the weblinks. If I try to do it seperately
wget -r -l2 --reject=gif -O out.html www.google.com
sed -n 's/.*href="\([^"]*\).*/\1/p' < out.html
Output
http://www.google.com/intl/en/options/
/intl/en/policies/terms/
It is not displaying all the links
ttp://www.google.com
http://maps.google.com
https://play.google.com
http://www.youtube.com
http://news.google.com
https://mail.google.com
https://drive.google.com
http://www.google.com
http://www.google.com
http://www.google.com
https://www.google.com
https://plus.google.com
And more over I want to get links from 2nd level and more can any one give a solution for this
Thanks in advance
The -O file option captures the output of wget and writes it to the specified file, so there is no output going through the pipe to sed.
You can say -O - to direct wget output to standard output.
If you don't want to use grep, you can try
sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*/\1/gp"

using wget to download a directory

I'm trying to download all the files in an online directory. The command I'm using is:
wget -r -np -nH -R index.html
http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
Using this command I get an empty directory. If I specify file names at the end I can get one at a time, but I'd like to get them all at once. Am I just missing something simple?
output from command:
--2015-03-14 14:54:05-- http://www.oecd-nea.org/dbforms/data/evaevatapes/mendl_2/
Resolving www.oecd-nea.org... 193.51.64.80
Connecting to www.oecd-nea.org|193.51.64.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: âdbforms/data/eva/evatapes/mendl_2/index.htmlâdbforms/data/eva/evatapes/mendl_2/index.htmlârobots.txtârobots.txt
Add the depth of links you want to follow (-l1, since you only want to follow one link):
wget -e robots=off -l1 -r -np -nH -R index.html http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
I also added -e robots=off, since there is a robots.txt which would normally stop wget from going through that directory. For the rest of the world:
-r recursive,
-np no parent directory
-nH no spanning across hosts

How to download all images from a website using wget?

Here is an example of my command:
wget -r -l 0 -np -t 1 -A jpg,jpeg,gif,png -nd --connect-timeout=10 -P ~/support --load-cookies cookies.txt "http://support.proboards.com/" -e robots=off
Based on the input here
But nothing really gets downloaded, no recursive crawling, it takes just a few seconds to complete. I am trying to backup all images from a forum, is the forum structure causing issues?
wget -r -P /download/location -A jpg,jpeg,gif,png http://www.site.here
works like a charm
Download image file with another name.
Here I provide the wget.zip file name as shown below.
# wget -O wget.zip http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz
--2012-10-02 11:55:54-- http://ftp.gnu.org/gnu/wget/wget-1.5.3.tar.gz
Resolving ftp.gnu.org... 208.118.235.20, 2001:4830:134:3::b
Connecting to ftp.gnu.org|208.118.235.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 446966 (436K) [application/x-gzip]
Saving to: wget.zip
100%[===================================================================================>] 446,966 60.0K/s in 7.5s
2012-10-02 11:56:02 (58.5 KB/s) - wget.zip

requesting data indefinitely in curl

I have 200MB file to download. I don't want to download it directly by passing URL to cURL (because my college blocks requests with more than 150MB).
So, I can download data by 10MB chunks, by passing range parameters to cURL. But I don't know how many 10MB chunks to download. Is there a way in cURL so that I can download data indefinitely. Something more like
while(next byte present)
download byte;
Thanks :)
command line curl lets you specify a range to download, so for your 150meg max, you'd do something like
curl http://example.com/200_meg_file -r 0-104857600 > the_file
curl http://example.com/200_meg_file -r 104857601-209715200 >> the_file
and so on until the entire thing's downloaded, grabbing 100meg chunks at a time and appending each chunk to the local copy.
Curl already has the ability to resume a download. Just run like this:
$> curl -C - $url -o $output_file
Of course this won't figure out when to stop, per se. However it would be pretty easy to write a wrapper. Something like this:
#!/bin/bash
url="http://someurl/somefile"
out="outfile"
touch "$out"
last_size=-1
while [ "`du -b $out | sed 's/\W.*//'`" -ne "$last_size" ]; do
curl -C - "$url" -o "$out"
last_size=`du -b $out | sed 's/\W.*//'`
done
I should note that curl outputs a fun looking error:
curl: (18) transfer closed with outstanding read data remaining
However I tested this on a rather large ISO file, and the md5 still matched up even though the above error was shown.

Resources