Building a file upload site that scales

Building a file upload site that scales - performance

I'm attempting to build a file upload site as a side project, and I've never built anything that needed to handle a large amount of files like this. As far as I can tell, there are three major options for storing and retrieving the files (note that there can be multiple files per upload, so, for example, website.com/a23Fc may let you download a single or multiple files, depending on how many the user originally uploaded - similar to imgur.com):
Stick all the files in one huge files directory, and use a (relational) DB to figure out which files belong to which URLs, then return a list of filenames depending on that. Example: user loads website.com/abcde, so it queries the DB for all files related to the abcde uploads, returns their filenames, and the site outputs those.
Use CouchDB because it allows you to actually attach files to individual records in the DB, so each URL/upload could be a DB record with files attached to it. Example, user loads website.com/abcde, CouchDB grabs the document with the ID of abcde, grabs the files attached to that document, and gives them to the user.
Skip out on using a DB completely, and for each upload, create a new directory and stick the files in that. Example: user loads website.com/abcde, site looks for a /files/abcde/ directory, grabs all the files out of there, and gives them to the user, so a database isn't involved at all.
Which of these seems to most scalable? Like I said, I have very little experience in this area so if I'm completely off or if there is an obvious 4th option, I'm more than open to it. Having thousands or millions of files in a single directory (i.e., option 1) doesn't seem very smart, but having thousands or millions of directories in a directory (i.e., option 3) doesn't seem much better.

A company I used to work for faced this exact problem with about a petabyte of image files. Their solution was to use the Andrew File System (see http://en.wikipedia.org/wiki/Andrew_File_System for more) to store the files in a directory structure that matched the URL structure. This scaled very well in practice.
They also recorded the existence of the files in a database for other reasons that were internal to their application.

I recommend whichever solution you can personally complete in the shortest amount of time. If you already have working CouchDB prototypes, go for it! Same for a relational-oriented or filesystem-oriented solution.
Time-to-market is more important than architecture for two reasons:
This is a side project, you should try to get as far along as possible.
If the site becomes popular, since the primary purpose is file upload, you are likely to rebuild the core service at least once, perhaps more, during the life of the site.

If you are going to user ASP.NET here is article that describes how to use Distributed File System for web farm http://weblogs.asp.net/owscott/archive/2006/06/07/DFS-for-Webfarm-Usage---Content-Replication-and-Failover.aspx

Related

When using asp.net MVC core + EF core + ability to encrypt files. What will be the differences between Blob, FileStream & File System, to manage files

I am working on an asp.net core mvc web application, the web application is a document management workflow. where inside each of the workflow steps users can upload documents, as follow:-
users can upload documents with the following restriction; a file can not exceed 5 MB + all the documents inside a workflow can not exceed 50 MB, unless admin approves it. they can upload as many documents as they want.
we will have lot of views which will show the step and all its documents attached to it, and users can chose to download the documents.
we can have unlimited number of workflows. as the more users register with our application the more workflow will be created.
certain files can be marked as confidential, so they should be encrypted when storing them either inside the database or inside the file system.
we are planning to use EF core as the data access layer for our web application + SQL server 2016 or 2017.
now my question is how we should manage our files, where i found these 3 approaches.
Blob.
FileStream
File system.
now the first approach, will allow us to encrypt the files inside the database + will work with EF. but it will have a huge drawback on performance, since opening a file or querying the files from database means they will be loaded inside the hosting server memory. so since we are seeking for an extensible approach, so i think this approach will not work for us since it is less scalable.
Second approach. will have better performance compared to first approach (Blob), but FileStream are not supported with EF + does not allow encryption. so we have to exclude this also.
third approach. of storing the files inside a folder which have the workflow ID + store the link to the file/folder inside the DB. will allow us to encrypt the files + will work with EF. and have a better performance compared to Blob (not sure if this is valid for FileStream). the only drawback, is that we can not achieve Atomic-ity between the files and their related records inside the database. but with adding some code we can handle this by our-self. for example deleting a database record will delete all its documents inside the folder, and we can add some background jobs to make sure all the documents have database records, other wise to delete the documents..
so based on the above i found that the third approach is the best fit for our need? so can anyone advice on this please? are my assumption correct? and is there a fourth appraoch or a hybrid appraoch that can be a better fit for us?

Although modern RDBMS have been optimised for data storage with the perks of integrity and atomicity, databases should be considered the least most alternative (StackOverflow posts like this and this shall corroborate the above) and therefore the third option mentioned or an improvement thereof shall be the vote.
For instance, a potential improvement would be to store the files renamed to a hash of the content and database the hash which shall eliminate all OS restrictions on subdirectories/files, filenames, and paths. Moreover, with a well structured directory layout duplicates could be filtered out.
The User-defined Database Functions shall aid in achieving atomicity which will efface the need of background jobs. An excellent guide on UDFs particularly for the use of accessing filesytem and invoking an executable can be found here.

Reliable Ways to Send Large Files to Clients

we have a need to regularly provide large files to clients on a daily or weekly basis. Currently our process is this:
Internal process creates the file and places it in a specific folder
Our client connects via SFTP and downloads the file
This work well when the files are small. As they get bigger (50-100 GB in size), we keep getting network interruptions and internal disk space related issues.
What I'd like to see is the following:
Our internal process creates the file.
This file is copied to an intermediary service (similar to something like FileDropper).
Our client will download the file from this intermediary service.
I'd like to know if other people had similar issues and what possible solutions are in place. File Dropper works great for non-business related files but obviously I won't be putting client data on there. We also have an Office 365 subscription. I tried to see what I could use with that but I haven't found anything yet that would help solve this.
Any hints, suggestions or feedback is much appreciated!

Consider Amazon S3.
I have used it several times in the past and it is very reliable both for processing a lot of files and for processing large files

should images come from db or content\Images folder

I am developing a eCommerce website in ASP.NET MVC 3 in C#. Using SQL Server 2008R2. My question is if I have 5 images that I want to show in gridView with thumbnails (e.g. something like Amazon website that gives customers couple of pictures to show) would it be advisory if the images are coming from the database or should I reside in the Content\Images folder? There are quite a few sub-categories in sub-category in my db design. What is the most common suit for a professional developer to follow? Thanks. I know there are few options for third party tools like jquery & Telerik Extensions. So I will use them.
Thanks

From my experience and research it is better to put it in a folder/content structure. Yes, there are security things with opening directories to the public but if you instead upload a file via ftp dynamically the problems are solved. I have heard of horror stories about storing files in database and have seen the issues come up but have resolved them. Basically, it is easier to write to database and there are not the security issues of opening up a directory to public but just make sure to regularly check backups that the files are not corrupt or make sure the data is on a fail over cluster where that will never be a problem.
So summary: Database is fine just regularly check backups by restoring them that they are not corrupt or run as a fail over cluster. Otherwise just go with the typical folder/content structure but use ftp to upload the file so there are no open directories to the public.

For me, the best anwser to this question is this: To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
Sumary: Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.

downloading large amount of files

I'm researching solutions for a potential client. They're requesting the ability to download a large amount of MP3's (1000+) from their online catalog.
I've researched/tested building a zip containing all MP3s using ZipArchive but ran into obvious memory leak issues that have ruled that solution out.
I'm now trying to think out of the box.
One idea was to create an FTP queue or a Torrent type download link for them. Is there anything out there that can pull something like this off?
Any help or suggested direction would be greatly appreciated! Thanks!!
Edit: Here is the overall process/goal that we're trying to achieve.
The client creates music for TV/Flim placement. They maintain a online catalog AND a local copy they send to potential buyers. The online catalog and the offline catalog need to mirror each other. Problem being, they have multiple offices that will have to update their local copy with the new files added to the online catalog from many different locations
Example: East Coast User updates catalog with 100 new files. West Coast User needs to update the offline catalog with the new files retrieved from the online catalog.
We had hoped to create custom zip's of the files each user needed to update their catalog based on the user's download history that we'd maintain in MySQL. We were testing ZipArchive but we couldn't seem to build Zips over 175 MEG (give or take). We're in the process of testing ZipStreaming but are having some issues.
I hope this clears up the overall goal and problems we are facing.

GNU wget?
It can download recursive. Just give wget a list of all files on the server, e.G.
http://www.example.org/filelist.html which contains links like file1.mp3, file2.mp3 etc (apache normally generates such an index file automatically wenn a directory without index.html/php in it gets called.
http://linux.die.net/man/1/wget

Frankly speaking, I can't identify the actual problem/question from your post. If you are looking for minimizing network load, then you need to remember that MP3 files are not compressed well because they are already compressed (not as well as possible, but well). If you are looking for a transport, than any file transfer protocol will do (FTP, SFTP, HTTP, WebDAV).
If you need flexibility and features, I'd recommend SFTP: this is a protocol for remote file system access, so besides "get file" operation it has plenty of useful operations including machine-readable directory listing (not always available in FTP and not available in standard HTTP), built-in ZLib compression, built-in possibility to resume file transfers and more bonuses. HTTP also has ZLib compression, but this one is not always available.
Update: your approach doesn't care about what is really available on the client and you are going to prepare ZIP files based on your (possibly incorrect) knowledge of the client already has.
If the client and server are both applications that you develop, then you should use RSync protocol or something similar to update data online (not using any ZIP files) and download the files that are missing on the client. If direct communication between the client and the server is not possible, you can make the client send his state to the server and the server will prepare an individual package after that. As for ZIP functionality - it's needed only when you use batch update (no real-time communication between the client and the server). I don't know what technology you are using but if your only problem is with ZIP component, you can use something else for data packing - either different ZIP component (for .NET and VCL we have ZIP component) or some other packing solution (for example, our SolFS product doesn't have size limits). Unfortunately I am not aware of RSync-like implementation available as a component.

Windows Registry best practices

In what way is the Windows registry meant to be used? I know it's alright to store a small amount of user preferences, but is it considered bad practice to store all your users data there? I would think it would depend on the data set, so how about for small amounts of data, say, less than 2KB, in 100 or so different key/value pairs. Is this bad practice? Would a flat file or SQLite db be a better practice?

I'm going to take a contrarian view.
The registry is a fine place to put configuration data of all types. In general it is faster than most configuration files and more reliable (individual operations on the registry are transacted so if your app crashes during a write the registry isn't corrupted - in general that isn't the case with ini files).
Marcelo MD is totally right: Storing things like operation percentage complete in the registry (or any other non volitile storage) is a horrible idea. On the other hand storing data like the most recently used files is just fine - the registry was built for just that kind of problem.
A number of the other commenters on this post talking about the MRU list have discussed the problem of what happens when the MRU list gets out of sync due to application crashes. I'm wondering why storing the MRU list in a flat file in per-user storage is any better?
I'm also not sure what the "security implications" of storing your data in the registry are. The registry is just as secure as the filesystem - the registry and the filesystem use the same ACL mechanism to protect their data.
If you ARE going to store your user data in a file, you should absolutely put your data in %APPDATA%\CompanyName\ApplicationName at least - that way if two different developers create an application with the same name (how many "Media Manager" applications are there out there?) you won't have collisions.

For me, simple user configuration items and user data is better to be stored in either a simple XML configuration file, a SQLLite db, or a MS SQL Server Compact db. The exact storage medium depends on the specifics of the implementation.
I only use the registry for things that I need to set infrequently and that users don't need to be able to change/see. For example, I have stored encrypted license information in the registry before to avoid accidental user removal of the data.

Using the registry to store data has mainly one problem: It's not very user-friendly. Users have virtually no chance of backing up their settings, copying them to another computer, troubleshooting them (or resetting them) if they get corrupted, or generally just see what their software is doing.
My rule of thumb is to use the registry only to communicate with the OS. Filetype associations, uninstaller entries, processes to run at startup, those things obviously have to be in the registry.
But data that is for use in your application only belongs in a file in your App Data folder. (whiever one of the 3+ App Data folders Microsoft currently wants you to use, anyway)

As each user has directory space in Windows already dedicated to storing application user data, I use it to store the user-level data (preferences, for instance) there.
In C#, I would get it by doing something like this:
Environment.GetFolderPath( Environment.SpecialFolder.ApplicationData);
Typically, I'll store SQLite files there or whatever is appropriate for the application.

If your app is going to be deployed "in the enterprise", keep in mind that administrators can tweak the registry using group policy tools. For example, if firefox used the registry for things like the proxy server, it would make deployment a snap because an admin can use the standard tools in active directory to set it up. If you use anything else, I dont think such things can be done very easily.
So don't dismiss the registry all together. If there is a chance an admin might want to standardize parts of your configuration across a network, put the setting in the registry.

I think Microsoft is encouraging use of isolated storage instead of the Windows registry.
Here's an article that explains how to use it in .Net.
You can find those files in Windows XP under Documents & Settings\\Local Settings\ App Data\Isolated Storage. The data is in .dat files

I would differentiate:
On the one hand there is application specific configuration data that is needed for the app to run, e.g. IP addresses to connect to, which folders to use for what sort of files etc, and non trivial per user settings.
Those I put in a config file, ini format for simple stuff, xml if it gets more complex.
On the other hand there is trivial per user settings (best example: window positions and layout). To avoid cluttering the config files (which some users will want to edit themselves, so few and clearly arranged entries are a must), I like to put those in the registry (with conservative defaults being set in the app if no settings in the registry can be found).
I mainly do it like istmatt sais: I store config files inside the %APPDATA% folder. Usually in %APPDATA%\ApplicationName, I don't like the .NET default of APPDATA%\CompanyName\ApplicationName\Version, that level of detail and complexity is counterproductive for most small to medium sized applications.
I disagree with the example of Marcelo MD of not storing recently used files in the registry. IMO this is exactly the volatile sort of user specific information that can be stored there.
(His example of what not to do is very good, though!)

To me it seems easier to think of what you should NOT put there.
e.g: dynamic data, such as an editor's "last file opened" and per project options. It is really annoying when your app loses sync with the registry (file deletion, system crash, etc) and retrieves information that is not valid anymore, possibly deadlocking the user.
At an earlier job I saw a guy that stored a data transfer completness percentage there, Writing the new values at every 10k or so and having the GUI retrieve this value every second so it could show on the titlebar.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio