How Would I Serve Thousands of Files per Request - spring

I am working on an application where the user has the potential to download thousands of files in one request into a zip file. Obviously, this will not be practical for our server. What would be the best way to go about serving up thousands of files to users?
Right now, what I have been working on is just have the jquery fileDownload library make a request for 100 files, then in the success handler call the fileDownload again for another 100 files offset by 100. The problem with this is that the fileDownload library (or the server) waits about 20 seconds until the fileDownload fail callback is called.
The other problem with this method is it isn't practical for the client to receive hundreds of pop windows asking them if they want to download 100 files.
We also won't be able to send back thousands of files in the response because our server doesn't and won't have that much memory.

This is purely opinion based on my experience but two options i have seen in use:
Option 1:
Batch process files, compress, then advise user of download location. This should be limited number of files and size tho as it can burn out the server resources. I don't recommend this if you have large number of users.
Option 2 (Best):
Batch process files into compressed file, then either enable uses to FTP into the location to obtain the files, or if your users have FTP location, have the file transfered over to the FTP location. I can tell you definitely this is most effective and is used by number of corporations i have been invovled with.

Related

Transfer file takes too much time

I have an empty API in laravel code with nginx & apache server. Now the problem is that the API takes a lot of time if I try with different files and the API responds quickly if I try with blank data.
Case 1 : I called the API with a blank request, that time response time will be only 228ms.
Case 2 : I called the API with a 5MB file request, then file transfer taking too much time. that's why response time will be too long that is 15.58s.
So how can we reduce transfer start time in apache or nginx server, Is there any server configuration or any other things that i missed up ?
When I searched on google it said keep all your versions up-to-date and use php-fpm, but when I configure php-fpm and http2 protocol on my server I noticed that it takes more time than above. All server versions are up-to-date with the current version.
This has more to do with the fact one request has nothing to process so the response will be prompt, whereas, the other request requires actual processing and so a response will take as long as the server requires to process the content of your request.
Depending on the size of the file and your server configuration, you might hit a limit which will result in a timeout response.
A solution to the issue you're encountering is to chunk your file upload. There are a few packages available so that you don't have to write that functionality yourself, an example of such a package is the Pionl Laravel Chunk Upload.
An alternative solution would be to offload the file processing to a Queue.
Update
When I searched on google about chunking it's not best solution for
small file like 5-10 MB. It's a best solution for big files like
50-100 MB. So is there any server side chunking configuration or any
other things or can i use this library to chunking a small files ?
According to the library document this is a web library. What should I
use if my API is calling from Android and iOS apps?
True, chunking might not be the best solution for smaller files but it is worthwhile knowing about. My recommendation would be to use some client-side logic to determine if sending the file in chunks is required. On the server use a Queue to process the file upload in the background allowing the request to continue processing without waiting on the upload and a response to be sent back to the client (iOS/Android app) in a timely manner.

Apache Camel - SFTP latency

I'm using Apache Camel to interact with several SFTP endpoints; for each one, I perform the following pipeline:
retrieve the list of existing files
validate those files against a given set of rules
download remote files, in case of successful validation
Everything works like a charm (for about a hundred different endpoints) and the URI used to retrieve the list of files is something like that: sftp://${HOST}:${PORT}/${DIR}?username=${USER}&download=false&recursive=true&disconnect=true&sendEmptyMessageWhenIdle=true
The problem is that, for one of those SFTP endpoints, the SFTP Camel component behaves, alternatively, as follows:
immediately return 0 remote files
takes a couple of minutes to list the remote content (which is composed by around 250 files, from 2KB to 2MB each)
In addition, in the latter case, the download takes around 30 seconds to download only 10KB of data.
Since this is happening on this specific SFTP only, I suppose it doesn't directly depend on Camel, which works fine for all other endpoints.
So, my questions are:
what can affect such a connection, leading to an unreasonable delay (there are no network issues, nor huge data to fetch)?
supposing it depends on the remote SFTP endpoint, why should the aforementioned Camel URI immediately return 0 files, since lots of files exist in the SFTP?
Thanks for any feedback.
Let's assume there is no bug in the Camel SFTP component of your version.
what can affect such a connection, leading to an unreasonable delay
(there are no network issues, nor huge data to fetch)?
Consider the fact that your app can immediately return 0 remote files, the problem source exist between your app and target server is relatively low. For server side, it could be
Too many folders to traverse
Server have slow action on each call
other problem on server side
For the case (Too many folders to traverse), consider to ignore folders that are useless and other config (e.g. stepwise)
supposing it depends on the remote SFTP endpoint, why should the
aforementioned Camel URI immediately return 0 files, since lots of
files exist in the SFTP?
The server side could be using multiple SFTP server nodes and some nodes are empty due to file system synchronization failure. When client is being redirect to any empty SFTP server node by server side's gateway, server node return 0 remote files in response and client report as-is.

What about the server uptime when using CGI-binaries?

I have a need to convert some of my perl CGI-scripts to binaries.
But when I have a script of 100kb converted into binary it becomes about 2-3Mb. This is understood why, as compiler has to pack inside all the needed tools to execute the script.
The question is about the time of pages loading on the server, when they are binary. Say, if I have a binary perl-script "script", that answers on ajax requests and that binary weights about 3mb, will it reflect on AJAX requests? If, say, some users have low connection, will they wait for ages until all these 3Mb will be transferred? Or, the server WON'T send all the 3mb to a user, but just an answer (short XML/JSON whatsoever)?
Another case is when I have HTML page, that is generated by this binary perl-script on the server. User addresses his browser to the script, that weights 3Mb and after he has to get an HTML page. Will the user wait again, until the whole script is been loaded (every single byte form those 3Mb), or just wait the time that is needed to load EXACTLY the HTML page (say, 70Kb), and the rest mass will be run on the server-side only and won't make the user to wait for it?
Thanks!
Or, the server WON'T send all the 3mb to a user, but just an answer (short XML/JSON whatsoever)?
This.
The server executes the program. It sends the output of the program to the client.
There might be an impact on performance by bundling the script up (and it will probably be a negative one) but that has to do with how long it takes the server to run the program and nothing to do with how long it takes to send data back to the client over the network.
Wrapping/Packaging a perl script into a binary can be useful for ease of transport or installation. Some folks even use it as a (trivial) form of obfuscation. But in the end, the act of "Unpacking" the binary into usable components at the beginning of every CGI call will actually slow you down.
If you wish to improve performance in a CGI situation, you should seriously consider techniques that make your script persistent to eliminate startup time. mod_perl is an older solution to this problem. More modern solutions include FCGI or wrapping your script into it's own mini web server.
Now if you are delivering the script to a customer and a PHB requires wrapping for obfuscation purposes, then be comforted that the startup performance hit only occurs once if you write your script to be persistent.

Heroku - letting users download files from tmp

Let me start by saying I understand that heroku's dynos are temporary and unreliable. I only need them to persist for at most 5 minutes, and from what I've read that generally won't be an issue.
I am making a tool that gathers files from websites and zips the up for download. My tool does everything and creates the zip - I'm just stuck at the last part: providing the user with a way to download the file. I've tried direct links to the file location, and http GET requests, and Heroku didn't like either. I really don't want to have to set up AWS just to host a file that only needs to persist for a couple of minutes.. Is there another way to download files stored on /tmp?
As far as I know, you have absolutely no guarantee that a request goes to the same dyno as the previous request.
The best way to do this would probably be to either host the file somewhere else, like S3, or to send it immediately in the same request.
If you're generating the file in a background worker, then it most definitely won't work. Every process runs on a separate dyno.
See How Heroku Works for more information on their backend.

Best approach to collecting log files from remote machines?

I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.

Resources