LFTP to touch every file it's downloading - bash

I'm using lftp to run a backup job, from one location to another. And it's the only possible solution. But it works really great. I'm using this command:
/usr/bin/lftp -c "open -p 9922 -u jdoe,passw0rd sftp://sftpsiteurl.com; mirror -c -e -R -L /source-folder /destination-folder/"
But I need to change the greated date or modified date on the files comming down. Right now the date on the files is from the remote location. But I'm not sure on how to do this.
I can see that you can run some kind off script for validating the files coming down. But I'm unsure off the command, and I can't seem to find any examples.
Do anybody know if this is possible, and how to do it.

Related

wget - how to download all files that only include "480p" using wget from http server?

I want to download all files from a http server like:
http://dl.mysite.com/files/
and I also want to go inside each folder inside that folder.
But I do want to download only those files that have "480p" in their name.
What is the easiest solution for that using wget?
edit:
I want to have that script to be run each night from 2am to 6am to sync those files from that server to my PC.
The following wget command should work with the following flags:
wget -A "*480p*" -r -np -nc --no-check-certificate -e robots=off http://dl.mysite.com/files/
Explanation:
-A "480p" your pattern
-r, recursively recursively look through the folders
-np, --no-parent ignore links to a higher directory
-nc, --no-clobber If a file is downloaded more than once in the same directory, Wget’s behavior depends on a few options, including ‘-nc’. In certain cases, the local file will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved.
--no-check-certificate Don’t check the server certificate against the available certificate authorities.
-e, --execute command A command thus invoked will be executed after the commands in .wgetrc
robots=off robot exclusion
More information on wget flags can be found at the official GNU manual page: https://www.gnu.org/software/wget/manual/wget.html
With regards to it being run once per day, you may want to read up on Cron jobs. Taken from the documentation page at: https://help.ubuntu.com/community/CronHowto
A crontab file is a simple text file containing a list of commands meant to be run at specified times. It is edited using the crontab command. The commands in the crontab file (and their run times) are checked by the cron daemon, which executes them in the system background.
So basically you need to put your wget command into a file, and set the cron to run this file at the specified time.
Note: Windows does not have a native implementation of Cron, but you can achieve the same effect using the Windows Task Scheduler.

Retrieving latest file in a directory from a remote server

I was hoping to crack this myself, but it seems I have fallen at the first hurdle because I can't make head nor tale of other options I've read about.
I wish to access a database file hosted as follows (i.e. the hhsuite_dbs is a folder containing several databases)
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_08Oct15.tgz
Periodically, they update these databases, and so I want to download the lastest version. My plan is to run a bash script via cron, most likely monthly (though I've yet to even tackle the scheduling aspect of the task).
I believe the database is refreshed fortnightly, so if my script runs monthly I can expect there to be a new version. I'll then be running downstream programs that require the database.
My question is then, how do I go about retrieving this (and for a little more finesse I'd perhaps like to be able to check whether the remote file has changed in name or content to avoid a large download if unnecessary)? Is the best approach to query the name of the file, or the file property of date last modified (given that they may change the naming syntax of the file too?). To my naive brain, some kind of globbing of the pdb70 (something I think I can rely on to be in the filename) then pulled down with wget was all I had come up with so far.
EDIT Another confounding issue that has just occurred to me is that the file I want wont necessarily be the newest in the folder (as there are other types of databases there too), but rather, I need the newest version of, in this case, the pdb70 database.
Solutions I've looked at so far have mentioned weex, lftp, curlftpls but all of these seem to suggest logins/passwords for the server which I don't have/need if I was to just download it via the web. I've also seen mention of rsync, but of a cursory read it seems like people are steering clear of it for FTP uses.
Quite a few barriers in your way for this.
My first suggestion is that rather than getting the filename itself, you simply mirror the directory using wget, which should already be installed on your Ubuntu system, and let wget figure out what to download.
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
And new files will be created in the "safe" directory.
But that just gets you your mirror. You're still after is the "newest" file.
Luckily, wget sets the datestamp of files it downloads, if it can. So after mirroring, you might be able to do something like:
newestfile=$(ls -t /some/place/safe/pdb70*gz | head -1)
Note that this fails if ever there are newlines in the filename.
Another possibility might be to check the difference between the current file list and the last one. Something like this:
#!/bin/bash
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
rm index.html* *.gif # remove debris from mirroring an index
ls > /tmp/filelist.txt.$$
if [ -f /tmp/filelist.txt ]; then
echo "Difference since last check:"
diff /tmp/filelist.txt /tmp/filelist.txt.$$
fi
mv /tmp/filelist.txt.$$ /tmp/filelist.txt
You can parse the diff output (man diff for more options) to determine what file has been added.
Of course, with a solution like this, you could run your script every day and hopefully download a new update within a day of it being ready, rather than a fortnight later. Nice thing about --mirror is that it won't download files that are already on-hand.
Oh, and I haven't tested what I've written here. That's one monstrously large file.

bash script wget download files by date

I'm new to the world of bash scripting. Hoping to seek some help here.
Been messing about with the 'wget' command and found that it is quite neat! At the moment, it gets all contents from a https site, including all directories, and saves them all accordingly. Here's the command:
wget -r -nH –cut-dirs=1 -R index.html -P /home/snoiniM/data/in/ https://www.someWebSite.com/folder/level2 --user=someUserName --password=P#ssword
/home/snoiniM/data/in/folder/level2/level2-2013-07-01.zip saved
/home/snoiniM/data/in/folder/level2/level2-2013-07-02.zip saved
/home/snoiniM/data/in/folder/level2/level2-2013-07-03.zip saved
/home/snoiniM/data/in/folder/level3/level3-2013-07-01.zip saved
/home/snoiniM/data/in/folder/level3/level3-2013-07-02.zip saved
/home/snoiniM/data/in/folder/level3/level3-2013-07-03.zip saved
That is fine for all intends and purposes. But what if I really just want to get a specific date from all its directories? E.g. just levelx-2013-07-03.zip from all dirs within folder and save all to 1 directory locally (e.g. all *zip will be in ...folder/)
Does anyone know how to do this?
I found that dropping -cut-dirs=1 and on the URL www.someWebsite.com/folder/ is sufficient.
Also, with that in mind, added the -nd option. This means no directories -- "Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering."
This means, we're left with one more part -- how to write a bash script, which gets yesterday date, parse it to the wget command as a parameter?
E.g.
wget -r -nH -nd -R index.html -A *$yesterday.zip -P /home/snoiniM/data/in/ https://www.someWebSite.com/folder/ --user=someUserName --password=P#ssword
Just the snippet you are looking for:
yesterday=$(date --date="#$(($(date +%s)-86400))" +%Y-%m-%d)
And no need of the * before yesterday; just treat it as a suffix.

Why lftp mirror --only-newer does not transfer "only newer" file?

I want to automate to upload files of my websites. But, remote server does not support ssh, so I try lftp command below instead of rsync.
lftp -c "set ftp:use-mdtm no && set ftp:timezone -9 && open -u user,password ftp.example.com && mirror -Ren local_directory remote_directory"
If local files are not changed, no files are uploded by this command. But, I change a file and run the command, all files are uploaded.
I know lftp/ftp's MDTM problem. So, I tried "set ftp:use-mdtm no && set ftp:timezone -9", but all files are uploaded though I changed only one file.
Is anyone know why lftp mirror --only-newer does not transfer "only newer" file?
On the following page
http://www.bouthors.fr/wiki/doku.php?id=en:linux:synchro_lftp
the authors state:
When uploading, it is not possible to set the date/time on the files uploaded, that's why --ignore-time is needed.
so if you use the flag combination --only-newer and --ignore-time you can achieve decent backup properties, in such a way that all files that differ in size are replaced. Of course it doesn't help if you really need to rely on time-synchronization but if it is just to perform a regular backup of data, it'll do the job.
mirror -R -n works for me as a very simple backup of new files

How to resume an ftp download at any point? (shell script, wget option)?

I want to download a huge file from an ftp server in chunks of 50-100MB each. At each point, I want to be able to set the "starting" point and the length of the chunk I want. I won't have the "previous" chunks saved locally (i.e. I can't ask the program to "resume" the download).
What is the best way of going about that? I use wget mostly, but would something else be better?
I'm really interested in a pre-built/in-build function rather than using a library for this purpose... Since wget/ftp (also, I think) allow resumption of downloads, I don't see if that would be problem... (I can't figure out from all the options though!)
I don't want to keep the entire huge file at my end, just process it in chunks... fyi all - I'm having a look at continue FTP download afther reconnect which seems interesting..
Use wget with:
-c option
Extracted from man pages:
-c / --continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program. For instance:
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
If there is a file named ls-lR.Z in the current directory, Wget will assume that it is the first portion of the remote file, and will ask the server to continue the retrieval from an offset equal to the length of the local file.
For those who'd like to use command-line curl, here goes:
curl -u user:passwd -C - -o <partial_downloaded_file> ftp://<ftp_path>
(leave out -u user:pass for anonymous access)
I'd recommend interfacing with libcurl from the language of your choice.

Resources