How to upload 50gb file through SFTP efficiently? - hadoop

I want to implement SFTP client using JSCH java library. I have below queries. Please suggest your ideas.
How to upload very big files( around 50gb) to SFTP server in best way?
While doing above operation, there is a high chance of getting error "session timeout". Is there any best way to solve it other than setting the time explicitly?

Transfer very large files at once is always not the best idea.
One option is to use a bigger buffer, if the upload/download is not used to max.
But my suggested option is: split up the file automatically or use a tempfile mechanic, which can handle interruptions at transfer (like e.g. jdownloader does)

Related

How to only read a few lines from a remote file?

Before downloading file, I need to set up a way it (the .csv typically, but not always) will be parsed.
I don't want to download the whole file especially if the "headers" do not match what is expected.
Is there a way to only download up until a certain number of byes and then gracefully kill the connection?
There's no explicit support for this in an FTP protocol.
There's an expired draft for RANG command that would allow this:
https://datatracker.ietf.org/doc/html/draft-bryan-ftp-range-08
But that's obviously supported by only new FTP servers.
Though there's nothing that prevents you from initiating a normal (full) download and forcefully break it as soon you get the amount of data you need.
All you need to do is to close the data transfer connection. This is basically what all FTP clients do, when an end user decides to abort the transfer.
This approach might result in few error messages in an FTP server log.
If you can use an SFTP protocol, then it's easy. The SFTP supports this natively.

How Would I Serve Thousands of Files per Request

I am working on an application where the user has the potential to download thousands of files in one request into a zip file. Obviously, this will not be practical for our server. What would be the best way to go about serving up thousands of files to users?
Right now, what I have been working on is just have the jquery fileDownload library make a request for 100 files, then in the success handler call the fileDownload again for another 100 files offset by 100. The problem with this is that the fileDownload library (or the server) waits about 20 seconds until the fileDownload fail callback is called.
The other problem with this method is it isn't practical for the client to receive hundreds of pop windows asking them if they want to download 100 files.
We also won't be able to send back thousands of files in the response because our server doesn't and won't have that much memory.
This is purely opinion based on my experience but two options i have seen in use:
Option 1:
Batch process files, compress, then advise user of download location. This should be limited number of files and size tho as it can burn out the server resources. I don't recommend this if you have large number of users.
Option 2 (Best):
Batch process files into compressed file, then either enable uses to FTP into the location to obtain the files, or if your users have FTP location, have the file transfered over to the FTP location. I can tell you definitely this is most effective and is used by number of corporations i have been invovled with.

Transfer a big file in golang

Client send file, the size may be more than 5G, to slave server, and than slave send to master server.
Will the slave save temp file to itself? I do not want it happen because it will slow the upload speed and waste the slave's memory.
Any way to avoid this? And what is the best way to transfer a big file in golang?
Yes, there's a standard way to avoid store-and-forward approach: as soon as a client connects the slave server the latter should open a connection to the master server and then just stream the data from the client there. Typically this is done using the io.Copy() function. Thanks to Go's excellent duck typing using interfaces, this works for TCP connections and HTTP requests/responses.
(To get better explanation(s) you have to narrow your question down.)
A part of the solution does even appear in the similar questions suggested by stackoverflow—here it is.

Scripting a major multi-file multi-server FTP upload: is smart interrupted transfer resuming possible?

I'm trying to upload several hundred files to 10+ different servers. I previously accomplished this using FileZilla, but I'm trying to make it go using just common command-line tools and shell scripts so that it isn't dependent on working from a particular host.
Right now I have a shell script that takes a list of servers (in ftp://user:pass#host.com format) and spawns a new background instance of 'ftp ftp://user:pass#host.com < batch.file' for each server.
This works in principle, but as soon as the connection to a given server times out/resets/gets interrupted, it breaks. While all the other transfers keep going, I have no way of resuming whichever transfer(s) have been interrupted. The only way to know if this has happened is to check each receiving server by hand. This sucks!
Right now I'm looking at wput and lftp, but these would require installation on whichever host I want to run the upload from. Any suggestions on how to accomplish this in a simpler way?
I would recommend using rsync. It's really good at only transferring just the data that's been changed during a transfer. Much more efficient than FTP! More info on how to resume interrupted connections with an example can be found here. Hope that helps!

Best approach to collecting log files from remote machines?

I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.

Resources