Downloading Only Newest File Using Wget / Curl - bash

How would I use wget or curl to download the newest file in a directory?
This seems really easy, however the filename won't always be predictable, and as new data comes in it'll be replaced with a random filename.
Specifically, the directory I wish to download data from has the following naming structure, where the last string of characters is a randomly generated timestamp:
MRMS_RotationTrackML1440min_00.50_20160530-175837.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-182639.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-185637.grib2.gz
The randomly generated timestamp is in the format of: {hour}{minute}{second}
The directory in question is here: http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
Could it have to be something with something in the headers, where you'd use curl to sift through the last-modified timestamp?
Any help would be appreciated here, thanks in advance.

You can just run following command periodically:
wget -r -nc --level=1 http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
It will download recursively whatever is new in the directory after last run.

Related

WGET saves with wrong file and extension name possibly due to BASH

I`ve tried this on a few forum threads already.
However I keep on getting the some failure as a result.
To replicate the problem :
Here is an url leading to a forum thread with 6 pages.
http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1/vc/1
What I typed into the console was :
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1"
And here is what I got:
--2018-06-14 10:44:17-- http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/%7B1..6%7D/vc/1
Resolving forex.kbpauk.ru (forex.kbpauk.ru)... 185.68.152.1
Connecting to forex.kbpauk.ru (forex.kbpauk.ru)|185.68.152.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: '1'
1 [ <=> ] 19.50K 58.7KB/s in 0.3s
2018-06-14 10:44:17 (58.7 KB/s) - '1' saved [19970]
The file was saved as simply "1" with no extension as it seems.
My expectations were that the file will be saved with an .html extension, because its a webpage.
Im trying to get WGET to work, but if its possible to do what I want with CURL than I would also accept that as an answer.
Well, there's a couple of issues with what you're trying to do.
The double quotes around your URL actually prevent Bash expansion, so you're not really downloading 6 files, but a single URL with "{1..6}" in it. You probably want to not have quotes around the URL to allow bash to expand it into 6 different parameters.
I notice that all of the pages are called "1", irrespective of their actual page numbers. This means the server is always serving a page with the same name, making it very hard for Wget or any other tool to actually make a copy of the webpage.
The real way to create a mirror of the forum would be to use this command line:
$ wget -m --no-parent -k --adjust-extension http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1
Let me explain what this command does:
-m --mirror activates the mirror mode (recursion)
--no-parent asks Wget to not go above the directory it starts from
-k --convert-links will edit the HTML pages you download so that the links in them will point to the other local pages you have also downloaded. This allows you to browse the forum pages locally without needing to be online
--adjust-extension This is the option you were originally looking for. It will cause Wget to save the file with a .html extension if it downloads a text/html file but the server did not provide an extension.
simply use the -O switch to specify the output filename, otherwise wget just defaults to something like in your case its 1
so if you wanted to call your file what-i-want-to-call-it.html then you would do
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1" -o what-i-want-to-call-it.html
if you type into the console wget --help you will get a full list of all the options that wget provides
To verify it has worked type the following to output
cat what-i-want-to-call-it.html

Downloading pdf files with wget. (characters after file extension?)

I'm trying to dowload recursively all .pdf files from a webpage.
The files URL have this format:
"http://example.com/fileexample.pdf?id=****"
I'm using these parameters:
wget -r -l1 -A.pdf http://example.com
wget is rejecting all the files when saving. Getting this error when using --debug:
Removing file due to recursive rejection criteria in recursive_retrieve()
I think that's happening because of this "?id=****" after the extension.
But did you try -A "*.pdf*" ? Regarding the wget docs, this should work out.

How to parse html with wget to download an artifact using pattern matching against Jenkins

I am trying to download an artifact from Jenkins where I need the latest build. If I curl jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build it brings me to the page that contains the artifact I need to download, which in my case is myCompany-1234.ipa
so by changing curl to wget with --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ it downloads the index.html file.
if I put --reject index.html it stops the index.html downloading.
if I add the name of the artifact with a wildcard like so MyCompany-*.ipait downloads a 14k file named MyCompany-*.ipa, not the MyCompany-1234.ipa I was hoping for. Keep in mind the page i am requesting only has 1 MyCompany-1234.ipa so there will never be multiple matches found
if I use a flag to pattern match -A "*.ipa" like so: wget --auth-no-challenge -A "*.ipa" https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ It still doesn't download the artifact.
It works if I perfectly input the exact url like so: wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/MyCompany-1234.ipa
The problem here is the .ipa is not always going to be 1234, tomorrow will be 1235, and so on. How can I either parse the html or use the wildcard correctly in wget to ensure I am always getting the latest?
NM, working with another engineer here at my work came up with a super elegant solution parsing json.
Install Chrome and get the plugin JSONView
Call the Jenkins API in your Chrome browser using https://$domain/$job/lastSuccessfulBuild/api/json
This will print out the key pair values in json. Denote your Key, for me it was number
brew install jq
In a bash script create a variable that will store the dynamic value as follows
This will store the build number to latest:
latest=$(curl --silent --show-error https://userIsMe:123myapikey321#jenkins.mycompany.com/job/build_ios/lastSuccessfulBuild/api/json | jq '.number')
Print it to screen if you would like:
echo $latest
Now with some string interpolation pass the variable for latest to your wget call:
wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/myCompany-$latest.ipa
Hope this helps someone out as there is limited info out there that is clear and concise, especially given that wget has been around for an eternity.

bash scripting de-dupe

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.
Easiest way to do this?
Thanks!
Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt
The first time you do:
$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]
The next time it'll be like this:
$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.
Except if the file on the server was updated.
That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?
You only know that, make some tests.
So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):
newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
xz -f file.txt # overwrite with the new compressed data
else
rm file.txt
fi
and that's done;
Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.
Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.
How about downloading the file, and checking it against a "last saved" file?
For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.
Don't know if this would work well, but it's what I could think of.
You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.

How can I modify .xfdl files? (Update #1)

The .XFDL file extension identifies XFDL Formatted Document files. These belong to the XML-based document and template formatting standard. This format is exactly like the XML file format however, contains a level of encryption for use in secure communications.
I know how to view XFDL files using a file viewer I found here. I can also modify and save these files by doing File:Save/Save As. I'd like, however, to modify these files on the fly. Any suggestions? Is this even possible?
Update #1: I have now successfully decoded and unziped a .xfdl into an XML file which I can then edit. Now, I am looking for a way to re-encode the modified XML file back into base64-gzip (using Ruby or the command line)
If the encoding is base64 then this is the solution I've stumbled upon on the web:
"Decoding XDFL files saved with 'encoding=base64'.
Files saved with:
application/vnd.xfdl;content-encoding="base64-gzip"
are simple base64-encoded gzip files. They can be easily restored to XML by first decoding and then unzipping them. This can be done as follows on Ubuntu:
sudo apt-get install uudeview
uudeview -i yourform.xfdl
gunzip -S "" < UNKNOWN.001 > yourform-unpacked.xfdl
The first command will install uudeview, a package that can decode base64, among others. You can skip this step once it is installed.
Assuming your form is saved as 'yourform.xfdl', the uudeview command will decode the contents as 'UNKNOWN.001', since the xfdl file doesn't contain a file name. The '-i' option makes uudeview uninteractive, remove that option for more control.
The last command gunzips the decoded file into a file named 'yourform-unpacked.xfdl'.
Another possible solution - here
Side Note: Block quoted < code > doesn't work for long strings of code
The only answer I can think of right now is - read the manual for uudeview.
As much as I would like to help you, I am not an expert in this area, so you'll have to wait for someone more knowledgable to come down here and help you.
Meanwhile I can give you links to some documents that might help you:
UUDeview Home Page
Using XDFLengine
Gettting started with the XDFL Engine
Sorry if this doesn't help you.
You don't have to get out of Ruby to do this, can use the Base64 module in Ruby to encode the document like this:
irb(main):005:0> require 'base64'
=> true
irb(main):007:0> Base64.encode64("Hello World")
=> "SGVsbG8gV29ybGQ=\n"
irb(main):008:0> Base64.decode64("SGVsbG8gV29ybGQ=\n")
=> "Hello World"
And you can call gzip/gunzip using Kernel#system:
system("gzip foo.something")
system("gunzip foo.something.gz")

Resources