Trying curl in Windows to download latest avvdat-xxxxx.zip file - windows

I am trying to automate downloading the latest McAfee DAT ZIP file from their repo using curl on a Windows server. The "avvdat-xxxxx.zip" file obviously changes every day. The actual current file as of today is "avvdat-10352.zip." The file name increments every day. As an example, I can download this in a roundabout way by running something like this:
curl -L -x "http://myproxyserver:80" "https://update.nai.com/products/commonupdater/avvdat-103[50-53].zip" -O
Obviously I would make the [50-53] range much larger to allow this work over a longer period of time. The code above is just an example for the sake of brevity.
While this does download the intended zip file, it also creates a small 10-byte ZIP for the other files that it cannot find. For instance, the above curl command creates these 4 files:
avvdat-10350.zip (10 bytes)
avvdat-10351.zip (10 bytes)
avvdat-10352.zip (110,665,596 bytes)
avvdat-10353.zip (10 bytes)
Is there any way to use curl such that it doesn't generate those small files? Or, is there a better to way to do this altogether? This would be pretty simple in a Linux bash script, but I'm not nearly as fluent with Windows batch/powershell scripting.

What you need is an HTML-parser like xidel:
xidel -s "https://update.nai.com/products/commonupdater/" -e "//pre/a/#href[matches(.,'avvdat-.*\.zip')]"
avvdat-10355.zip
xidel -s "https://update.nai.com/products/commonupdater/" -f "//pre/a/#href[matches(.,'avvdat-.*\.zip')]" --download .
The 2nd command downloads 'avvdat-10355.zip' to the current dir.

Related

Resume an aborted recursive download with wget without checking the dates for already downloaded files

The following command was aborted:
wget -w 10 -m -H "<URL>"
I would like to resume this download without checking the dates on the server for every file that I've already downloaded.
I'm using: GNU Wget 1.21.3 built on darwin18.7.0.
The following doesn't work for me because it keeps requesting headers at a rate of 1 every 10 seconds, to not overwhelm the server, and then it doesn't download the files, but checking is very slow. 10 seconds times 80,000 files is a long time, and if it aborts again after 300,000 files resuming using this command will take even longer. In fact it takes as long as starting over, which I'd like to avoid.
wget -c -w 10 -m -H "<URL>"
The following is not recursive as the first file exists and subsequently not parsed for URLs to recursively download everything else.
wget -w 10 -r -nc -l inf --no-remove-listing -H "<URL>"
The result of this command is this:
File ‘<URL>’ already there; not retrieving.
The file that's "already there" contains links that should be followed, and if those files are "already there" then they too should not be retrieved. This process should continue until wget encounters files that haven't yet been downloaded.
I need to download 600,000 files without overwhelming the server and have already downloaded 80,000 files. wget should be able to zip through all the downloaded files really fast until it finds a missing file that it needs to download and then rate limit the downloads to 1 every 10 seconds.
I've read through the entire man page and can't find anything that looks like it will work except for what I have already tried. I don't care about the dates on the files, retrieving updated files, or downloading the rest of incomplete files. I only want to download files from the 600,000 that I haven't already downloaded without bogging down the server with unnecessary requests.
The file that's "already there" contains links that should be followed
If said file contains absolute links then you might try using combination of --force-html and -i file.html consider following simple example, let file.html content be
<html>
<body>
Example
Search
Archive
</body>
</html>
then
wget --force-html -i file.html -nc -r -l 1
does create following structure
file.html
www.example.com/index.html
www.duckduckgo.com/index.html
archive.org/index.html
archive.org/robots.txt
archive.org/index.html?noscript=true
archive.org/offshoot_assets/index.34c417fd1d63.css
archive.org/offshoot_assets/favicon.ico
archive.org/offshoot_assets/js/webpack-runtime.e618bedb4b40026e6d03.js
archive.org/offshoot_assets/js/index.60b02a82057240d1b68d.js
archive.org/offshoot_assets/vendor/lit#2.0.2/polyfill-support.js
archive.org/offshoot_assets/vendor/#webcomponents/webcomponentsjs#2.6.0/webcomponents-loader.js
and if you remove one of files, say archive.org/offshoot_assets/favicon.ico then subsequent run will download only that missing file.

wget hangs after large file download

I'm trying to download a large file over a ftp. (5GB file). Here is my script.
read ZipName
wget -c -N -q --show-progress "ftp://Password#ftp.server.com/$ZipName"
unzip $ZipName
The files downloads at 100% but never goes to the unzip command. No special error message, no outputs in the terminal. Just blank new line. I have to send CTRL + c and run back to script to unzip since wget detects that the file is fully downloaded.
Why does is hangs out like this? Is it because of the large file, or passing an argument in command?
By the way I can't use ftp because it's not on the VM i'm working on, and it's a temporary VM so no root privilege to install anything.
I've made some tests, and I think that size of the disk was the reason.
I've tried with curl -O and it worked for the same disk space.

Remote bash script and executing the make command

I have a device installed remotely that has Internet access. As i cannot SSH directly to it, the device downloads updates from a .txt file located in a server. This .txt file is interpreted by the device as a sequence of bash instructions.
Now, i'm planning an update that requires re-compiling a C program in the device after downloading and overwritting some files. The content of the .txt file for this update looks like:
#!/bin/bash
curl -o /path-to-file/newfile.c http://myserver/newfile.c
sleep 10 #wait so it can download
cd /path-to-file
make
sleep 10 #wait to make
sudo /path-to-file/my-program #test if it worked
I previously tested this method and it worked as expected, but never tested make. A couple of questions:
Should it work?
Is the sleep after make necessary?
Here is an example of how to retrieve a source code file into another directory, change to that directory, compile the source code with make and then execute the resulting binary:
mkdir -p path-to-file/
curl -o path-to-file/newfile.c http://www.csit.parkland.edu/~cgraff1/src/hello_world.c
cd path-to-file/
make newfile
./newfile
The cd is really not an integral part of the process, but it seems as if the question specifically pertains to performing the work in a directory other than the present working directory.

consecutive numbered files download with wget bash with option to skip some files during download

There is a homepage where I can download zip files numbered from 1 to 10000. At the moment I'm downloading them with this command:
$ wget http://someaddress.com/somefolder/{001..10000}
I don't need all of them but there is no logic in the order of the required zip files. I can only see is it needed or not when the download has already started. The unnecessary files sizes are much bigger than the others and that's increasing the downloading time so it would be great if somehow I can skip them. Is there any method in bash to do this?
You can use curl which has an option --max-filesize, and will not download files bigger than this. However, it depends on your website returning the correct size with a Content-Length header. You can check the headers with wget -S on a file
to see if they are provided. curl does not do url patterns, so you will have to write a shell for loop for each url.
Alternatively, sticking with wget and assuming you don't have a Content-Length, you could force a SIGPIPE
when you receive too much data.
For example,
wget http://someaddress.com/somefolder/1234 -O - |
dd bs=1k count=2 >/tmp/1234
This gets wget to pipe the downlaod into a dd command that will copy
through the data to the final file but stop after 2 blocks of 1024 bytes.
If less data is received the file will contain all you want.
If more data is received, the dd will stop, and when wget writes more
to the pipe it will be stopped by a signal.
You need to write a loop to do this for each url.

wget: delete incomplete files

I'm currently using a bash script to download several images using wget.
Unfortunately the server I am downloading from is less than reliable and therefore sometimes when I'm downloading a file, the server will disconnect and the script will move onto the next file, leaving the previous one incomplete.
In order to remedy this I've tried to add a second line after the script fetches all incomplete files using:
wget -c myurl.com/image{1..3}.png
This seems to work as wget goes back and completes download of the files, but the problem then comes from this: ImageMagick which I use to stich the images in a pdf, claims there are errors with the headers of the images.
My thought of what to with deleting the incomplete files is:
wget myurl.com/image{1..3}.png
wget -rmincompletefiles
wget -N myurl.com/image{1..3}.png
convert *.png mypdf.pdf
So the question is, what can I use in place of -rmincompletefiles that actually exists, or is there a better I should be approaching this issue?
I made surprising discovery when attempting to implement tvm's suggestion.
It turns out, and this something I didn't realize, that when you run wget -N, wget actually checks file sizes and verifies they are the same. If they are not, the files are deleted and then downloaded again.
So cool tip if you're having the same issue I am!
I've found this solution to work for my use case.
From the answer:
wget http://www.example.com/mysql.zip -O mysql.zip || rm -f mysql.zip
This way, the file will only be deleted if an error or cancellation occurred.
Well, I would try hard to download the files with wget (you can specify extra parameters like larger --timeout to give the server some extra time). wget assumes certain things about the partial downloads and even with proper resume, they can sometimes end up mangled (unless you check their eg. MD5 sums by other means).
Since you are using convert and bash, there will be most likely another tool available from the Imagemagick package - namely identify.
While certain features are surely poorly documented, it has one awesome functionality - it can identify broken (or partially downloaded images).
➜ ~ identify b.jpg; echo $?
identify.im6: Invalid JPEG file structure: ...
1
It will return exit status 1 if you call it on the inconsistent image. You can remove these inconsistent images using simple loop such as:
for i in *.png;
do identify "$i" || rm -f "$i";
done
Then I would try to download again the files that are broken.

Resources