Speeding up lftp mirroring with many directories - ftp

I am trying to mirror a public FTP to a local directory. When I use wget -m {url} then wget quite quickly skips lots of files that have been already downloaded (and no newer version exists), when I use lftp open -u user,pass {url}; mirror then lftp sends MDTM for every file before deciding whether to download the file or not. With 2 million+ files in 50 thousand+ directories this is very slow, besides I get error messages that MDTM of directories could not be obtained.
In the manual it says that using set sync-mode off will result in sending all requests at once, so that lftp doesn't wait for each response. When I do that, I get error messages from the server saying there are too many connections from my IP address.
I tried running wget first to download only the newer files, but this does not delete the files which were removed from the FTP server, so I follow up with lftp to remove the old files, however lftp still sends MDTM on each file, which means that there is no advantage to this approach.
If I use set ftp:use-mdtm off, then it seems that lftp just downloads all files again.
Could someone suggest the correct setting for lftp with large number of directories/files (specifically, so that it skips directories which were not updated, like wget seems to do)?

Use set ftp:use-mdtm off and mirror --ignore-time for the first invocation to avoid re-downloading all the files.
You can also try to upgrade lftp and/or use set ftp:use-mlsd on, in this case lftp will get precise file modification time from the MLSD command output (provided that the server supports the command).

Related

Bash script: get folder size through FTP

I searched a lot and googled a lot too but I didn't find anything...
I'm using bash on linux...
I have to download a certain folder from a ftp server (i know ftp is deprecated, can't use ftps or sftp right now btw i'm in a local network).
I want to do a sort of integrity check of the downloaded folder, which has a lot of subfolders and files, so i choose to compare folder size as a test.
I'm downloading through wget but my question is...how can I check the folder size BEFORE downloading it so that i can store the size in a file and then compare with the downloaded one? In ftp, so...
I tried with a simply curl to the parent directory but there is no size information there...
Thanks for the help!!
wget will recursively download all directory content one at a time. There's no way to determine the size of an entire directory using ftp. Though it's possible using ssh.
I recommend installing ssh server on your machine and after gaining an access, you can use the following command to get the size of the desired directory:
du -h desired_directory | tail -n 1
I do not recommend this method though, it's more reliable to get the hash checksum of the remote content and compare them with your downloaded content. It's far more reliable and it's already used by many download clients to check the integrity of the files.
It basically depends on what your ftp client and your ftp server can do. With some I know, the default ls does the job and they even have a size command:
ftp> help size
size show size of remote file
ftp> size foo
213 305
ftp> ls foo
200 PORT command successful.
150 Opening ASCII mode data connection for file list
-rw-r--r-- 1 foo bar 305 Aug 1 2013 foo
226 Transfer complete.
You can't check folder size via protocol FTP. Try to connect remote folder on the main server via curlftpfs (if it possible).

Correct LFTP command to upload only updated files

I am using codeship.io to upload files in a code repository to a shared hosting without SSH.
This is the original command, it tooks two hours to complete:
lftp -c "open -u $FTP_USER,$FTP_PASSWORD ftp.mydomain.com; set ssl:verify-certificate no; mirror -R ${HOME}/clone/ /public_html/targetfolder"
I tried to add -n, which is supposed to upload only newer files. But I can still see from the streaming logs that some unchanged files are being uploaded:
lftp -c "open -u $FTP_USER,$FTP_PASSWORD ftp.mydomain.com; set ssl:verify-certificate no; mirror -R -n ${HOME}/clone/ /public_html/targetfolder"
What is the correct command to correctly upload only updated files?
The command is correct.
The question is why lftp considers the files "changed". It uploads a file if it is missing, has different size of has different modification time.
You can try to do "ls" on the directory where lftp uploads the files to and see if the files are really present, have the same size and the same or newer modification time.
If for some reason the modification time is older, add --ignore-time to the mirror command.
Codeship builds the code first before deployment.
This means that the code in Codeship's temporary server is newer than anything else in your pipeline, even though the code itself may not have changed.
This is why when you use lftp's option of "only newer files", it simply means everything.
As far as I know, you can't upload only the actual newer files.

MIME type check via SFTP connection

I want to list images by SFTP and save this list, so another script may further process it.
Unfortunately, there are also many other files there, so I need to identify which are images. I am filtering out everything with wrong file extension, but I would like to go a step further and check also the content of the file.
Downloading everything to check it with file --mime-type on local machine is too slow. Is there a way how to check MIME type of a file on remote SFTP before the download?
We found a way, downloading only first 64 bytes. It is a lot faster than downloading whole file, but still enough to see if it looks like an image:
curl "sftp://sftp.example.com/path/to/file.png" -u login:pass -o img.tmp -r 0-64
file --mime-type img.tmp
MIME type is supported by SFTP version 6 and newer only.
Most SFTP clients and servers, including the most widespread one, OpenSSH, support SFTP version 3 only.
Even the servers that I know of to support SFTP version 6, like Bitvise or ProFTPD mod_sftp, do not support the "MIME type" attribute.
So while in theory it's possible to determine MIME type of remote files over SFTP, in practice, you won't be able to do it.
You can run any command remotely using ssh:
ssh <destination> '<command_to_run>'
In this case that would be something like:
ssh <remote_machine_name_or_ip_address> 'file --mime-type ./*'

wget syncing with changing remote HTTP files

I want to ensure an authorative remote file is in sync with a local file, without necessarily re-downloading the entire file.
I did mistakenly use wget -c http://example.com/filename
If "filename" was appended to remotely, that works fine. But if filename is prepended to, e.g. "bar" is prepended to a file just containing "foo", the end downloaded result filename contents in my test were wrongly "foo\nfoo", instead of "bar\nfoo".
Can anyone else suggest a different efficient http downloading tool? Something that looks at server caching headers or etags?
I believe that wget -N is what you are looking for. It turns on timestamping and allows wget to compare the local file timestamp with the remote timestamp. Keep in mind that you might still encounter corruption if the local file timestamp cannot be trusted e.g. if your local clock is drifting too much.
You could very well use curl: http://linux.about.com/od/commands/l/blcmdl1_curl.htm]1

How to resume an ftp download at any point? (shell script, wget option)?

I want to download a huge file from an ftp server in chunks of 50-100MB each. At each point, I want to be able to set the "starting" point and the length of the chunk I want. I won't have the "previous" chunks saved locally (i.e. I can't ask the program to "resume" the download).
What is the best way of going about that? I use wget mostly, but would something else be better?
I'm really interested in a pre-built/in-build function rather than using a library for this purpose... Since wget/ftp (also, I think) allow resumption of downloads, I don't see if that would be problem... (I can't figure out from all the options though!)
I don't want to keep the entire huge file at my end, just process it in chunks... fyi all - I'm having a look at continue FTP download afther reconnect which seems interesting..
Use wget with:
-c option
Extracted from man pages:
-c / --continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program. For instance:
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
If there is a file named ls-lR.Z in the current directory, Wget will assume that it is the first portion of the remote file, and will ask the server to continue the retrieval from an offset equal to the length of the local file.
For those who'd like to use command-line curl, here goes:
curl -u user:passwd -C - -o <partial_downloaded_file> ftp://<ftp_path>
(leave out -u user:pass for anonymous access)
I'd recommend interfacing with libcurl from the language of your choice.

Resources