wget syncing with changing remote HTTP files - shell

I want to ensure an authorative remote file is in sync with a local file, without necessarily re-downloading the entire file.
I did mistakenly use wget -c http://example.com/filename
If "filename" was appended to remotely, that works fine. But if filename is prepended to, e.g. "bar" is prepended to a file just containing "foo", the end downloaded result filename contents in my test were wrongly "foo\nfoo", instead of "bar\nfoo".
Can anyone else suggest a different efficient http downloading tool? Something that looks at server caching headers or etags?

I believe that wget -N is what you are looking for. It turns on timestamping and allows wget to compare the local file timestamp with the remote timestamp. Keep in mind that you might still encounter corruption if the local file timestamp cannot be trusted e.g. if your local clock is drifting too much.

You could very well use curl: http://linux.about.com/od/commands/l/blcmdl1_curl.htm]1

Related

Speeding up lftp mirroring with many directories

I am trying to mirror a public FTP to a local directory. When I use wget -m {url} then wget quite quickly skips lots of files that have been already downloaded (and no newer version exists), when I use lftp open -u user,pass {url}; mirror then lftp sends MDTM for every file before deciding whether to download the file or not. With 2 million+ files in 50 thousand+ directories this is very slow, besides I get error messages that MDTM of directories could not be obtained.
In the manual it says that using set sync-mode off will result in sending all requests at once, so that lftp doesn't wait for each response. When I do that, I get error messages from the server saying there are too many connections from my IP address.
I tried running wget first to download only the newer files, but this does not delete the files which were removed from the FTP server, so I follow up with lftp to remove the old files, however lftp still sends MDTM on each file, which means that there is no advantage to this approach.
If I use set ftp:use-mdtm off, then it seems that lftp just downloads all files again.
Could someone suggest the correct setting for lftp with large number of directories/files (specifically, so that it skips directories which were not updated, like wget seems to do)?
Use set ftp:use-mdtm off and mirror --ignore-time for the first invocation to avoid re-downloading all the files.
You can also try to upgrade lftp and/or use set ftp:use-mlsd on, in this case lftp will get precise file modification time from the MLSD command output (provided that the server supports the command).

How can I send an HTTPS request from a file?

Let's assume I have a file request.txt that looks like:
GET / HTTP/1.0
Some_header: value
text=blah
I tried:
cat request.txt | openssl -s_client -connect server.com:443
Unfortunately it didn't work and I need to manually copy & paste the file contents. How can I do it within a script?
cat is not ideally suited to download remote files, it's best used for files local to the file system running the script. To download a remote file you have other commands that you can use which handle this better.
If your environment has wget installed you can download the file by URL. Here is a link for some examples on how it's used. That would look like:
wget https://server.com/request.txt
If your environment has curl installed you can download the file by URL. Here is a link for some examples on how it's used. That would look like:
curl -O https://server.com/request.txt
Please note that if you want to store the response in a variable for further modification you can do this as well with a bit more work.
Also worth noting is that if you really must use cat to download a remote file it's possible, but it may require ssh to be used and I'm not a fan of using that method as it requires access to a file via ssh where it's already publicly available over HTTP/S. There isn't a practical reason I can think of to go about it this way, but for the sake of completion I wanted to mention that it could be done but probably shouldn't.

Retrieving latest file in a directory from a remote server

I was hoping to crack this myself, but it seems I have fallen at the first hurdle because I can't make head nor tale of other options I've read about.
I wish to access a database file hosted as follows (i.e. the hhsuite_dbs is a folder containing several databases)
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_08Oct15.tgz
Periodically, they update these databases, and so I want to download the lastest version. My plan is to run a bash script via cron, most likely monthly (though I've yet to even tackle the scheduling aspect of the task).
I believe the database is refreshed fortnightly, so if my script runs monthly I can expect there to be a new version. I'll then be running downstream programs that require the database.
My question is then, how do I go about retrieving this (and for a little more finesse I'd perhaps like to be able to check whether the remote file has changed in name or content to avoid a large download if unnecessary)? Is the best approach to query the name of the file, or the file property of date last modified (given that they may change the naming syntax of the file too?). To my naive brain, some kind of globbing of the pdb70 (something I think I can rely on to be in the filename) then pulled down with wget was all I had come up with so far.
EDIT Another confounding issue that has just occurred to me is that the file I want wont necessarily be the newest in the folder (as there are other types of databases there too), but rather, I need the newest version of, in this case, the pdb70 database.
Solutions I've looked at so far have mentioned weex, lftp, curlftpls but all of these seem to suggest logins/passwords for the server which I don't have/need if I was to just download it via the web. I've also seen mention of rsync, but of a cursory read it seems like people are steering clear of it for FTP uses.
Quite a few barriers in your way for this.
My first suggestion is that rather than getting the filename itself, you simply mirror the directory using wget, which should already be installed on your Ubuntu system, and let wget figure out what to download.
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
And new files will be created in the "safe" directory.
But that just gets you your mirror. You're still after is the "newest" file.
Luckily, wget sets the datestamp of files it downloads, if it can. So after mirroring, you might be able to do something like:
newestfile=$(ls -t /some/place/safe/pdb70*gz | head -1)
Note that this fails if ever there are newlines in the filename.
Another possibility might be to check the difference between the current file list and the last one. Something like this:
#!/bin/bash
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
rm index.html* *.gif # remove debris from mirroring an index
ls > /tmp/filelist.txt.$$
if [ -f /tmp/filelist.txt ]; then
echo "Difference since last check:"
diff /tmp/filelist.txt /tmp/filelist.txt.$$
fi
mv /tmp/filelist.txt.$$ /tmp/filelist.txt
You can parse the diff output (man diff for more options) to determine what file has been added.
Of course, with a solution like this, you could run your script every day and hopefully download a new update within a day of it being ready, rather than a fortnight later. Nice thing about --mirror is that it won't download files that are already on-hand.
Oh, and I haven't tested what I've written here. That's one monstrously large file.

MIME type check via SFTP connection

I want to list images by SFTP and save this list, so another script may further process it.
Unfortunately, there are also many other files there, so I need to identify which are images. I am filtering out everything with wrong file extension, but I would like to go a step further and check also the content of the file.
Downloading everything to check it with file --mime-type on local machine is too slow. Is there a way how to check MIME type of a file on remote SFTP before the download?
We found a way, downloading only first 64 bytes. It is a lot faster than downloading whole file, but still enough to see if it looks like an image:
curl "sftp://sftp.example.com/path/to/file.png" -u login:pass -o img.tmp -r 0-64
file --mime-type img.tmp
MIME type is supported by SFTP version 6 and newer only.
Most SFTP clients and servers, including the most widespread one, OpenSSH, support SFTP version 3 only.
Even the servers that I know of to support SFTP version 6, like Bitvise or ProFTPD mod_sftp, do not support the "MIME type" attribute.
So while in theory it's possible to determine MIME type of remote files over SFTP, in practice, you won't be able to do it.
You can run any command remotely using ssh:
ssh <destination> '<command_to_run>'
In this case that would be something like:
ssh <remote_machine_name_or_ip_address> 'file --mime-type ./*'

How to download all files from hidden directory

I have do download all log files from a virtual directory within a site. The access to virtual directory is forbidden but files are accessible.
I have manually entered the file names to download
dir="Mar"
for ((i=1;i<100;i++)); do
wget http://sz.dsyn.com/2014/$dir/log_$i.txt
done
The problem is the script is not generic and most of the time I need to find out how many files are there and tweak the for loop. Is there a way to trigger wget to fetch all files without me bothering to specify the exact count.
Note:
If I use the browser to view http://sz.dsyn.com/2014/$dir, it is 403 forbidden. I cant pull all the files via browser tool/extension.
First of all check this similar question If this is not what you are looking for, you need to generate a file of URLs within and feed wget. e.g.
wget --input-file=http://sz.dsyn.com/2014/$dir/filelist.txt
wget will have the same problem your browser has: it cannot read the directory. Just pull until your first failure then quit.

Resources