wget: delete incomplete files - bash

I'm currently using a bash script to download several images using wget.
Unfortunately the server I am downloading from is less than reliable and therefore sometimes when I'm downloading a file, the server will disconnect and the script will move onto the next file, leaving the previous one incomplete.
In order to remedy this I've tried to add a second line after the script fetches all incomplete files using:
wget -c myurl.com/image{1..3}.png
This seems to work as wget goes back and completes download of the files, but the problem then comes from this: ImageMagick which I use to stich the images in a pdf, claims there are errors with the headers of the images.
My thought of what to with deleting the incomplete files is:
wget myurl.com/image{1..3}.png
wget -rmincompletefiles
wget -N myurl.com/image{1..3}.png
convert *.png mypdf.pdf
So the question is, what can I use in place of -rmincompletefiles that actually exists, or is there a better I should be approaching this issue?

I made surprising discovery when attempting to implement tvm's suggestion.
It turns out, and this something I didn't realize, that when you run wget -N, wget actually checks file sizes and verifies they are the same. If they are not, the files are deleted and then downloaded again.
So cool tip if you're having the same issue I am!

I've found this solution to work for my use case.
From the answer:
wget http://www.example.com/mysql.zip -O mysql.zip || rm -f mysql.zip
This way, the file will only be deleted if an error or cancellation occurred.

Well, I would try hard to download the files with wget (you can specify extra parameters like larger --timeout to give the server some extra time). wget assumes certain things about the partial downloads and even with proper resume, they can sometimes end up mangled (unless you check their eg. MD5 sums by other means).
Since you are using convert and bash, there will be most likely another tool available from the Imagemagick package - namely identify.
While certain features are surely poorly documented, it has one awesome functionality - it can identify broken (or partially downloaded images).
➜ ~ identify b.jpg; echo $?
identify.im6: Invalid JPEG file structure: ...
1
It will return exit status 1 if you call it on the inconsistent image. You can remove these inconsistent images using simple loop such as:
for i in *.png;
do identify "$i" || rm -f "$i";
done
Then I would try to download again the files that are broken.

Related

Resume an aborted recursive download with wget without checking the dates for already downloaded files

The following command was aborted:
wget -w 10 -m -H "<URL>"
I would like to resume this download without checking the dates on the server for every file that I've already downloaded.
I'm using: GNU Wget 1.21.3 built on darwin18.7.0.
The following doesn't work for me because it keeps requesting headers at a rate of 1 every 10 seconds, to not overwhelm the server, and then it doesn't download the files, but checking is very slow. 10 seconds times 80,000 files is a long time, and if it aborts again after 300,000 files resuming using this command will take even longer. In fact it takes as long as starting over, which I'd like to avoid.
wget -c -w 10 -m -H "<URL>"
The following is not recursive as the first file exists and subsequently not parsed for URLs to recursively download everything else.
wget -w 10 -r -nc -l inf --no-remove-listing -H "<URL>"
The result of this command is this:
File ‘<URL>’ already there; not retrieving.
The file that's "already there" contains links that should be followed, and if those files are "already there" then they too should not be retrieved. This process should continue until wget encounters files that haven't yet been downloaded.
I need to download 600,000 files without overwhelming the server and have already downloaded 80,000 files. wget should be able to zip through all the downloaded files really fast until it finds a missing file that it needs to download and then rate limit the downloads to 1 every 10 seconds.
I've read through the entire man page and can't find anything that looks like it will work except for what I have already tried. I don't care about the dates on the files, retrieving updated files, or downloading the rest of incomplete files. I only want to download files from the 600,000 that I haven't already downloaded without bogging down the server with unnecessary requests.
The file that's "already there" contains links that should be followed
If said file contains absolute links then you might try using combination of --force-html and -i file.html consider following simple example, let file.html content be
<html>
<body>
Example
Search
Archive
</body>
</html>
then
wget --force-html -i file.html -nc -r -l 1
does create following structure
file.html
www.example.com/index.html
www.duckduckgo.com/index.html
archive.org/index.html
archive.org/robots.txt
archive.org/index.html?noscript=true
archive.org/offshoot_assets/index.34c417fd1d63.css
archive.org/offshoot_assets/favicon.ico
archive.org/offshoot_assets/js/webpack-runtime.e618bedb4b40026e6d03.js
archive.org/offshoot_assets/js/index.60b02a82057240d1b68d.js
archive.org/offshoot_assets/vendor/lit#2.0.2/polyfill-support.js
archive.org/offshoot_assets/vendor/#webcomponents/webcomponentsjs#2.6.0/webcomponents-loader.js
and if you remove one of files, say archive.org/offshoot_assets/favicon.ico then subsequent run will download only that missing file.

Is it possible to use a wildcard in a directory name in wget for ftp?

I'm sorry if this has been answered already, but I couldn't find the solution to my problem.
I need to download files from an ftp server that are inside a structure that looks like this:
ftp://ftp.some.adress/one/first_wildcard/two/second_wildcard/three/
I tried using * as a wildcard but it's not working:
wget -r -np -nd --accept='myfile*.ext' ftp://ftp.some.adress/one/*/two/*/three/
When I try this I get the error
ERROR 404: Not Found.
This doesn't happen if I specify the exact names for the directories. If I put the entire path the file downloads without a problem, which is why I'm guessing the problem is in the *. Needless to say I don't want to specify the name one by one, since there are many directories.
I also tried adding --glob=on (because I have no idea what I'm doing) but the result is the same.
In case you need to see the ftp, here's the one I'm working with. User: anonymous, no password.
ftp://ftp.cccma.ec.gc.ca/data/climdex/CMIP5/historical/
and this works for me:
wget -r -np -nd --accept='prcptot*.nc' ftp://ftp.cccma.ec.gc.ca/data
/climdex/CMIP5/historical/ACCESS1-0/r1i1p1/v1/base_1961-1990/ --glob=on
And the wildcards should be in the place of ACCESS1-0 and v1, i.e. doing the same thing for every directory available in those levels.
Any suggestion to make this work or to achieve the same goal with another tool (ftp, lftp, something else)?

Retrieving latest file in a directory from a remote server

I was hoping to crack this myself, but it seems I have fallen at the first hurdle because I can't make head nor tale of other options I've read about.
I wish to access a database file hosted as follows (i.e. the hhsuite_dbs is a folder containing several databases)
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_08Oct15.tgz
Periodically, they update these databases, and so I want to download the lastest version. My plan is to run a bash script via cron, most likely monthly (though I've yet to even tackle the scheduling aspect of the task).
I believe the database is refreshed fortnightly, so if my script runs monthly I can expect there to be a new version. I'll then be running downstream programs that require the database.
My question is then, how do I go about retrieving this (and for a little more finesse I'd perhaps like to be able to check whether the remote file has changed in name or content to avoid a large download if unnecessary)? Is the best approach to query the name of the file, or the file property of date last modified (given that they may change the naming syntax of the file too?). To my naive brain, some kind of globbing of the pdb70 (something I think I can rely on to be in the filename) then pulled down with wget was all I had come up with so far.
EDIT Another confounding issue that has just occurred to me is that the file I want wont necessarily be the newest in the folder (as there are other types of databases there too), but rather, I need the newest version of, in this case, the pdb70 database.
Solutions I've looked at so far have mentioned weex, lftp, curlftpls but all of these seem to suggest logins/passwords for the server which I don't have/need if I was to just download it via the web. I've also seen mention of rsync, but of a cursory read it seems like people are steering clear of it for FTP uses.
Quite a few barriers in your way for this.
My first suggestion is that rather than getting the filename itself, you simply mirror the directory using wget, which should already be installed on your Ubuntu system, and let wget figure out what to download.
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
And new files will be created in the "safe" directory.
But that just gets you your mirror. You're still after is the "newest" file.
Luckily, wget sets the datestamp of files it downloads, if it can. So after mirroring, you might be able to do something like:
newestfile=$(ls -t /some/place/safe/pdb70*gz | head -1)
Note that this fails if ever there are newlines in the filename.
Another possibility might be to check the difference between the current file list and the last one. Something like this:
#!/bin/bash
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
rm index.html* *.gif # remove debris from mirroring an index
ls > /tmp/filelist.txt.$$
if [ -f /tmp/filelist.txt ]; then
echo "Difference since last check:"
diff /tmp/filelist.txt /tmp/filelist.txt.$$
fi
mv /tmp/filelist.txt.$$ /tmp/filelist.txt
You can parse the diff output (man diff for more options) to determine what file has been added.
Of course, with a solution like this, you could run your script every day and hopefully download a new update within a day of it being ready, rather than a fortnight later. Nice thing about --mirror is that it won't download files that are already on-hand.
Oh, and I haven't tested what I've written here. That's one monstrously large file.

WGET: Removing 'filename' since it should be rejected

I am trying to download all the wmv files that have the word 'high' on their name, in a website using wget with the following command:
wget -A "*high*.wmv" -r -H -l1 -nd -np -erobots=off http://mywebsite.com -O yl-`date +%H%M%S`.wmv
The file starts and finishes downloading but just after it downloads I get
Removing yl-120058.wmv since it should be rejected.
Why is that and how could I avoid it?
How could I make the command to
spider the whole website for those
type of files automatically?
It's because the accept list is being checked twice, once before downloading and once after saving. The latter is the behavior you see here ("it's not a bug, it's a feature"):
Your saved file yl-120058.wmv does not match your specified pattern -A "high.wmv" and will be thus rejected and deleted.
Quote from wget manual:
Finally, it's worth noting that the accept/reject lists are matched twice against downloaded files: [..] the local file's name is also checked against the accept/reject lists to see if it should be removed. [..] However, this can lead to unexpected results.

How to resume an ftp download at any point? (shell script, wget option)?

I want to download a huge file from an ftp server in chunks of 50-100MB each. At each point, I want to be able to set the "starting" point and the length of the chunk I want. I won't have the "previous" chunks saved locally (i.e. I can't ask the program to "resume" the download).
What is the best way of going about that? I use wget mostly, but would something else be better?
I'm really interested in a pre-built/in-build function rather than using a library for this purpose... Since wget/ftp (also, I think) allow resumption of downloads, I don't see if that would be problem... (I can't figure out from all the options though!)
I don't want to keep the entire huge file at my end, just process it in chunks... fyi all - I'm having a look at continue FTP download afther reconnect which seems interesting..
Use wget with:
-c option
Extracted from man pages:
-c / --continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program. For instance:
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
If there is a file named ls-lR.Z in the current directory, Wget will assume that it is the first portion of the remote file, and will ask the server to continue the retrieval from an offset equal to the length of the local file.
For those who'd like to use command-line curl, here goes:
curl -u user:passwd -C - -o <partial_downloaded_file> ftp://<ftp_path>
(leave out -u user:pass for anonymous access)
I'd recommend interfacing with libcurl from the language of your choice.

Resources