Resume an aborted recursive download with wget without checking the dates for already downloaded files - download

The following command was aborted:
wget -w 10 -m -H "<URL>"
I would like to resume this download without checking the dates on the server for every file that I've already downloaded.
I'm using: GNU Wget 1.21.3 built on darwin18.7.0.
The following doesn't work for me because it keeps requesting headers at a rate of 1 every 10 seconds, to not overwhelm the server, and then it doesn't download the files, but checking is very slow. 10 seconds times 80,000 files is a long time, and if it aborts again after 300,000 files resuming using this command will take even longer. In fact it takes as long as starting over, which I'd like to avoid.
wget -c -w 10 -m -H "<URL>"
The following is not recursive as the first file exists and subsequently not parsed for URLs to recursively download everything else.
wget -w 10 -r -nc -l inf --no-remove-listing -H "<URL>"
The result of this command is this:
File ‘<URL>’ already there; not retrieving.
The file that's "already there" contains links that should be followed, and if those files are "already there" then they too should not be retrieved. This process should continue until wget encounters files that haven't yet been downloaded.
I need to download 600,000 files without overwhelming the server and have already downloaded 80,000 files. wget should be able to zip through all the downloaded files really fast until it finds a missing file that it needs to download and then rate limit the downloads to 1 every 10 seconds.
I've read through the entire man page and can't find anything that looks like it will work except for what I have already tried. I don't care about the dates on the files, retrieving updated files, or downloading the rest of incomplete files. I only want to download files from the 600,000 that I haven't already downloaded without bogging down the server with unnecessary requests.

The file that's "already there" contains links that should be followed
If said file contains absolute links then you might try using combination of --force-html and -i file.html consider following simple example, let file.html content be
<html>
<body>
Example
Search
Archive
</body>
</html>
then
wget --force-html -i file.html -nc -r -l 1
does create following structure
file.html
www.example.com/index.html
www.duckduckgo.com/index.html
archive.org/index.html
archive.org/robots.txt
archive.org/index.html?noscript=true
archive.org/offshoot_assets/index.34c417fd1d63.css
archive.org/offshoot_assets/favicon.ico
archive.org/offshoot_assets/js/webpack-runtime.e618bedb4b40026e6d03.js
archive.org/offshoot_assets/js/index.60b02a82057240d1b68d.js
archive.org/offshoot_assets/vendor/lit#2.0.2/polyfill-support.js
archive.org/offshoot_assets/vendor/#webcomponents/webcomponentsjs#2.6.0/webcomponents-loader.js
and if you remove one of files, say archive.org/offshoot_assets/favicon.ico then subsequent run will download only that missing file.

Related

Trying curl in Windows to download latest avvdat-xxxxx.zip file

I am trying to automate downloading the latest McAfee DAT ZIP file from their repo using curl on a Windows server. The "avvdat-xxxxx.zip" file obviously changes every day. The actual current file as of today is "avvdat-10352.zip." The file name increments every day. As an example, I can download this in a roundabout way by running something like this:
curl -L -x "http://myproxyserver:80" "https://update.nai.com/products/commonupdater/avvdat-103[50-53].zip" -O
Obviously I would make the [50-53] range much larger to allow this work over a longer period of time. The code above is just an example for the sake of brevity.
While this does download the intended zip file, it also creates a small 10-byte ZIP for the other files that it cannot find. For instance, the above curl command creates these 4 files:
avvdat-10350.zip (10 bytes)
avvdat-10351.zip (10 bytes)
avvdat-10352.zip (110,665,596 bytes)
avvdat-10353.zip (10 bytes)
Is there any way to use curl such that it doesn't generate those small files? Or, is there a better to way to do this altogether? This would be pretty simple in a Linux bash script, but I'm not nearly as fluent with Windows batch/powershell scripting.
What you need is an HTML-parser like xidel:
xidel -s "https://update.nai.com/products/commonupdater/" -e "//pre/a/#href[matches(.,'avvdat-.*\.zip')]"
avvdat-10355.zip
xidel -s "https://update.nai.com/products/commonupdater/" -f "//pre/a/#href[matches(.,'avvdat-.*\.zip')]" --download .
The 2nd command downloads 'avvdat-10355.zip' to the current dir.

Considering a specific name for the downloaded file

I download a .tar.gz file using wget using this command:
wget hello.tar.gz
This is a part of a long script, sometimes when I want to download this file, an error occurs and when for the second time the file is downloaded the name of the downloaded file changes to something like this:
hello.tar.gz.2
the third time:
hello.tar.gz.3
How can I say that the whatever the name of the downloaded is, change it to hello.tar.gz?
In other words I don't want the name of the downloaded file be anything other than hello.tar.gz?
wget hello.tar.gz -O <fileName>
wget have internal option like -r, -p to change default behavior
So just try the following:
wget -p <url>
wget -r <url>
Since now you noticed the incremental change. Discard any repeated files and rely on the following as initial condition:
wget hello.tar.gz
mv hello.tar.gz.2 hello.tar.gz

consecutive numbered files download with wget bash with option to skip some files during download

There is a homepage where I can download zip files numbered from 1 to 10000. At the moment I'm downloading them with this command:
$ wget http://someaddress.com/somefolder/{001..10000}
I don't need all of them but there is no logic in the order of the required zip files. I can only see is it needed or not when the download has already started. The unnecessary files sizes are much bigger than the others and that's increasing the downloading time so it would be great if somehow I can skip them. Is there any method in bash to do this?
You can use curl which has an option --max-filesize, and will not download files bigger than this. However, it depends on your website returning the correct size with a Content-Length header. You can check the headers with wget -S on a file
to see if they are provided. curl does not do url patterns, so you will have to write a shell for loop for each url.
Alternatively, sticking with wget and assuming you don't have a Content-Length, you could force a SIGPIPE
when you receive too much data.
For example,
wget http://someaddress.com/somefolder/1234 -O - |
dd bs=1k count=2 >/tmp/1234
This gets wget to pipe the downlaod into a dd command that will copy
through the data to the final file but stop after 2 blocks of 1024 bytes.
If less data is received the file will contain all you want.
If more data is received, the dd will stop, and when wget writes more
to the pipe it will be stopped by a signal.
You need to write a loop to do this for each url.

wget: delete incomplete files

I'm currently using a bash script to download several images using wget.
Unfortunately the server I am downloading from is less than reliable and therefore sometimes when I'm downloading a file, the server will disconnect and the script will move onto the next file, leaving the previous one incomplete.
In order to remedy this I've tried to add a second line after the script fetches all incomplete files using:
wget -c myurl.com/image{1..3}.png
This seems to work as wget goes back and completes download of the files, but the problem then comes from this: ImageMagick which I use to stich the images in a pdf, claims there are errors with the headers of the images.
My thought of what to with deleting the incomplete files is:
wget myurl.com/image{1..3}.png
wget -rmincompletefiles
wget -N myurl.com/image{1..3}.png
convert *.png mypdf.pdf
So the question is, what can I use in place of -rmincompletefiles that actually exists, or is there a better I should be approaching this issue?
I made surprising discovery when attempting to implement tvm's suggestion.
It turns out, and this something I didn't realize, that when you run wget -N, wget actually checks file sizes and verifies they are the same. If they are not, the files are deleted and then downloaded again.
So cool tip if you're having the same issue I am!
I've found this solution to work for my use case.
From the answer:
wget http://www.example.com/mysql.zip -O mysql.zip || rm -f mysql.zip
This way, the file will only be deleted if an error or cancellation occurred.
Well, I would try hard to download the files with wget (you can specify extra parameters like larger --timeout to give the server some extra time). wget assumes certain things about the partial downloads and even with proper resume, they can sometimes end up mangled (unless you check their eg. MD5 sums by other means).
Since you are using convert and bash, there will be most likely another tool available from the Imagemagick package - namely identify.
While certain features are surely poorly documented, it has one awesome functionality - it can identify broken (or partially downloaded images).
➜ ~ identify b.jpg; echo $?
identify.im6: Invalid JPEG file structure: ...
1
It will return exit status 1 if you call it on the inconsistent image. You can remove these inconsistent images using simple loop such as:
for i in *.png;
do identify "$i" || rm -f "$i";
done
Then I would try to download again the files that are broken.

WGET: Removing 'filename' since it should be rejected

I am trying to download all the wmv files that have the word 'high' on their name, in a website using wget with the following command:
wget -A "*high*.wmv" -r -H -l1 -nd -np -erobots=off http://mywebsite.com -O yl-`date +%H%M%S`.wmv
The file starts and finishes downloading but just after it downloads I get
Removing yl-120058.wmv since it should be rejected.
Why is that and how could I avoid it?
How could I make the command to
spider the whole website for those
type of files automatically?
It's because the accept list is being checked twice, once before downloading and once after saving. The latter is the behavior you see here ("it's not a bug, it's a feature"):
Your saved file yl-120058.wmv does not match your specified pattern -A "high.wmv" and will be thus rejected and deleted.
Quote from wget manual:
Finally, it's worth noting that the accept/reject lists are matched twice against downloaded files: [..] the local file's name is also checked against the accept/reject lists to see if it should be removed. [..] However, this can lead to unexpected results.

Resources