WGET: Removing 'filename' since it should be rejected - shell

I am trying to download all the wmv files that have the word 'high' on their name, in a website using wget with the following command:
wget -A "*high*.wmv" -r -H -l1 -nd -np -erobots=off http://mywebsite.com -O yl-`date +%H%M%S`.wmv
The file starts and finishes downloading but just after it downloads I get
Removing yl-120058.wmv since it should be rejected.
Why is that and how could I avoid it?
How could I make the command to
spider the whole website for those
type of files automatically?

It's because the accept list is being checked twice, once before downloading and once after saving. The latter is the behavior you see here ("it's not a bug, it's a feature"):
Your saved file yl-120058.wmv does not match your specified pattern -A "high.wmv" and will be thus rejected and deleted.
Quote from wget manual:
Finally, it's worth noting that the accept/reject lists are matched twice against downloaded files: [..] the local file's name is also checked against the accept/reject lists to see if it should be removed. [..] However, this can lead to unexpected results.

Related

Resume an aborted recursive download with wget without checking the dates for already downloaded files

The following command was aborted:
wget -w 10 -m -H "<URL>"
I would like to resume this download without checking the dates on the server for every file that I've already downloaded.
I'm using: GNU Wget 1.21.3 built on darwin18.7.0.
The following doesn't work for me because it keeps requesting headers at a rate of 1 every 10 seconds, to not overwhelm the server, and then it doesn't download the files, but checking is very slow. 10 seconds times 80,000 files is a long time, and if it aborts again after 300,000 files resuming using this command will take even longer. In fact it takes as long as starting over, which I'd like to avoid.
wget -c -w 10 -m -H "<URL>"
The following is not recursive as the first file exists and subsequently not parsed for URLs to recursively download everything else.
wget -w 10 -r -nc -l inf --no-remove-listing -H "<URL>"
The result of this command is this:
File ‘<URL>’ already there; not retrieving.
The file that's "already there" contains links that should be followed, and if those files are "already there" then they too should not be retrieved. This process should continue until wget encounters files that haven't yet been downloaded.
I need to download 600,000 files without overwhelming the server and have already downloaded 80,000 files. wget should be able to zip through all the downloaded files really fast until it finds a missing file that it needs to download and then rate limit the downloads to 1 every 10 seconds.
I've read through the entire man page and can't find anything that looks like it will work except for what I have already tried. I don't care about the dates on the files, retrieving updated files, or downloading the rest of incomplete files. I only want to download files from the 600,000 that I haven't already downloaded without bogging down the server with unnecessary requests.
The file that's "already there" contains links that should be followed
If said file contains absolute links then you might try using combination of --force-html and -i file.html consider following simple example, let file.html content be
<html>
<body>
Example
Search
Archive
</body>
</html>
then
wget --force-html -i file.html -nc -r -l 1
does create following structure
file.html
www.example.com/index.html
www.duckduckgo.com/index.html
archive.org/index.html
archive.org/robots.txt
archive.org/index.html?noscript=true
archive.org/offshoot_assets/index.34c417fd1d63.css
archive.org/offshoot_assets/favicon.ico
archive.org/offshoot_assets/js/webpack-runtime.e618bedb4b40026e6d03.js
archive.org/offshoot_assets/js/index.60b02a82057240d1b68d.js
archive.org/offshoot_assets/vendor/lit#2.0.2/polyfill-support.js
archive.org/offshoot_assets/vendor/#webcomponents/webcomponentsjs#2.6.0/webcomponents-loader.js
and if you remove one of files, say archive.org/offshoot_assets/favicon.ico then subsequent run will download only that missing file.

Considering a specific name for the downloaded file

I download a .tar.gz file using wget using this command:
wget hello.tar.gz
This is a part of a long script, sometimes when I want to download this file, an error occurs and when for the second time the file is downloaded the name of the downloaded file changes to something like this:
hello.tar.gz.2
the third time:
hello.tar.gz.3
How can I say that the whatever the name of the downloaded is, change it to hello.tar.gz?
In other words I don't want the name of the downloaded file be anything other than hello.tar.gz?
wget hello.tar.gz -O <fileName>
wget have internal option like -r, -p to change default behavior
So just try the following:
wget -p <url>
wget -r <url>
Since now you noticed the incremental change. Discard any repeated files and rely on the following as initial condition:
wget hello.tar.gz
mv hello.tar.gz.2 hello.tar.gz

consecutive numbered files download with wget bash with option to skip some files during download

There is a homepage where I can download zip files numbered from 1 to 10000. At the moment I'm downloading them with this command:
$ wget http://someaddress.com/somefolder/{001..10000}
I don't need all of them but there is no logic in the order of the required zip files. I can only see is it needed or not when the download has already started. The unnecessary files sizes are much bigger than the others and that's increasing the downloading time so it would be great if somehow I can skip them. Is there any method in bash to do this?
You can use curl which has an option --max-filesize, and will not download files bigger than this. However, it depends on your website returning the correct size with a Content-Length header. You can check the headers with wget -S on a file
to see if they are provided. curl does not do url patterns, so you will have to write a shell for loop for each url.
Alternatively, sticking with wget and assuming you don't have a Content-Length, you could force a SIGPIPE
when you receive too much data.
For example,
wget http://someaddress.com/somefolder/1234 -O - |
dd bs=1k count=2 >/tmp/1234
This gets wget to pipe the downlaod into a dd command that will copy
through the data to the final file but stop after 2 blocks of 1024 bytes.
If less data is received the file will contain all you want.
If more data is received, the dd will stop, and when wget writes more
to the pipe it will be stopped by a signal.
You need to write a loop to do this for each url.

wget: delete incomplete files

I'm currently using a bash script to download several images using wget.
Unfortunately the server I am downloading from is less than reliable and therefore sometimes when I'm downloading a file, the server will disconnect and the script will move onto the next file, leaving the previous one incomplete.
In order to remedy this I've tried to add a second line after the script fetches all incomplete files using:
wget -c myurl.com/image{1..3}.png
This seems to work as wget goes back and completes download of the files, but the problem then comes from this: ImageMagick which I use to stich the images in a pdf, claims there are errors with the headers of the images.
My thought of what to with deleting the incomplete files is:
wget myurl.com/image{1..3}.png
wget -rmincompletefiles
wget -N myurl.com/image{1..3}.png
convert *.png mypdf.pdf
So the question is, what can I use in place of -rmincompletefiles that actually exists, or is there a better I should be approaching this issue?
I made surprising discovery when attempting to implement tvm's suggestion.
It turns out, and this something I didn't realize, that when you run wget -N, wget actually checks file sizes and verifies they are the same. If they are not, the files are deleted and then downloaded again.
So cool tip if you're having the same issue I am!
I've found this solution to work for my use case.
From the answer:
wget http://www.example.com/mysql.zip -O mysql.zip || rm -f mysql.zip
This way, the file will only be deleted if an error or cancellation occurred.
Well, I would try hard to download the files with wget (you can specify extra parameters like larger --timeout to give the server some extra time). wget assumes certain things about the partial downloads and even with proper resume, they can sometimes end up mangled (unless you check their eg. MD5 sums by other means).
Since you are using convert and bash, there will be most likely another tool available from the Imagemagick package - namely identify.
While certain features are surely poorly documented, it has one awesome functionality - it can identify broken (or partially downloaded images).
➜ ~ identify b.jpg; echo $?
identify.im6: Invalid JPEG file structure: ...
1
It will return exit status 1 if you call it on the inconsistent image. You can remove these inconsistent images using simple loop such as:
for i in *.png;
do identify "$i" || rm -f "$i";
done
Then I would try to download again the files that are broken.

bash script wget download files by date

I'm new to the world of bash scripting. Hoping to seek some help here.
Been messing about with the 'wget' command and found that it is quite neat! At the moment, it gets all contents from a https site, including all directories, and saves them all accordingly. Here's the command:
wget -r -nH –cut-dirs=1 -R index.html -P /home/snoiniM/data/in/ https://www.someWebSite.com/folder/level2 --user=someUserName --password=P#ssword
/home/snoiniM/data/in/folder/level2/level2-2013-07-01.zip saved
/home/snoiniM/data/in/folder/level2/level2-2013-07-02.zip saved
/home/snoiniM/data/in/folder/level2/level2-2013-07-03.zip saved
/home/snoiniM/data/in/folder/level3/level3-2013-07-01.zip saved
/home/snoiniM/data/in/folder/level3/level3-2013-07-02.zip saved
/home/snoiniM/data/in/folder/level3/level3-2013-07-03.zip saved
That is fine for all intends and purposes. But what if I really just want to get a specific date from all its directories? E.g. just levelx-2013-07-03.zip from all dirs within folder and save all to 1 directory locally (e.g. all *zip will be in ...folder/)
Does anyone know how to do this?
I found that dropping -cut-dirs=1 and on the URL www.someWebsite.com/folder/ is sufficient.
Also, with that in mind, added the -nd option. This means no directories -- "Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering."
This means, we're left with one more part -- how to write a bash script, which gets yesterday date, parse it to the wget command as a parameter?
E.g.
wget -r -nH -nd -R index.html -A *$yesterday.zip -P /home/snoiniM/data/in/ https://www.someWebSite.com/folder/ --user=someUserName --password=P#ssword
Just the snippet you are looking for:
yesterday=$(date --date="#$(($(date +%s)-86400))" +%Y-%m-%d)
And no need of the * before yesterday; just treat it as a suffix.

Resources