How to allow website visitors to download huge files

How to allow website visitors to download huge files - ftp

What is the best way to allow visitors to a website to download files the size of tens of gigabytes? In this case it is about movie files, so they are already relatively compressed.
Here are some assumptions:
Only one visitor will want to download a file at any time and nobody other than our servers has the file at any one time.
There is a high probability that the transfer will be interrupted due to network problems or other problems, so some kind of resume should be available.
The visitor will get a download link either on the website or via email.
Should work at least for visitors running Windows or OS X. It would be a bonus if it also works with some Linux distribution.
My experience with FTP and normal HTTP is that some visitors manage to download what they believe is the entire file but it then turns out not to be working because the file is incomplete or corrupted.
Printing the md5 sum or similar on the website and asking the visitor to run md5 on the download and then compare the results is too complicated for most website visitors.
Zipping the file adds a layer of error checking and completeness checking but often adds some manual steps for the visitor.
Is there some better solution?

I will presume you have custom videos on a closed web site that are per user. If this is the case you might want to implement a custom download manager that does just what you described:
1. grabs the link via magnetic link
2. downlaods the file and checks it against MD5hash
3. supports resume
An easy way to do this is to use external tool like wget to do this. to resume with wget you use -c option, here is an article describing this in greater detail:
http://www.cyberciti.biz/tips/wget-resume-broken-download.html
After this you want to check md5sum, asuming md5 hash is in file named filename.md5. This presumes that you have md5sum and wget on your system.
wget -c urlToYourfile{file.avi, file.avi.md5}
md5sum -c *.md5
So the actual code would be just a simple wrapper with easy installer that would do this. alternatively this could be done in a bit less fancy way with .bat script it just comes down to associating the script with url type and than jsut specify url in this way:
mydownload://url/file.avi and set browser to do this ... but if you have multiple random users it is actually easier to just make a simple wraper for magnetic links and voila
Best regards!

Related

Why is checking MD5 hash important

Oftentimes when you download a file, the site offering the download will list the MD5 hash for the file being downloaded.
But I've never had a problem with getting a bad download. In fact, I thought that since FTP was a TCP protocol, you couldn't get bad downloads.
Is there any data on how often bad downloads occur (i.e. when checking the MD5 hash tells you the download is bad)?

It's not so much a problem with the TCP/IP protocol accidentally swapping a bit (although that DID happen sometimes in the old days, it's not much a concern now).
MD5 is especially helpful when downloading a file from a mirror site. For example, getting an ISO for a new OS. The original site can give you the MD5, and then you can download the ISO from another company. To make sure that mirror has not tampered with the image at all, you can use the MD5.
In summary, MD5 is to validate the authenticity of the file - that may or may not mean a hardware level mishap. Usually it's something a bit more intentional and mischievous.

It is not for bad downloads mostly it is for verifying authenticity of downloadable to ensure that no one has tampered the downloadable.

Let's say that the file is outside of webserver you are connected to. The website has information about checksum, file size, file name.
When user has no assurance, that the file could be replaced with one which looks the same, but has some additional features, like malware, wrong mine-type, you can check check-sum to be sure about that.

Graceful File Reading without Locking

Whiteboard Overview
The images below are 1000 x 750 px, ~130 kB JPEGs hosted on ImageShack.
Internal
Global
Additional Information
I should mention that each user (of the client boxes) will be working straight off the /Foo share. Due to the nature of the business, users will never need to see or work on each other's documents concurrently, so conflicts of this nature will never be a problem. Access needs to be as simple as possible for them, which probably means mapping a drive to their respective /Foo/username sub-directory.
Additionally, no one but my applications (in-house and the ones on the server) will be using the FTP directory directly.
Possible Implementations
Unfortunately, it doesn't look like I can use off the shelf tools such as WinSCP because some other logic needs to be intimately tied into the process.
I figure there are two simple ways for me to accomplishing the above on the in-house side.
Method one (slow):
Walk the /Foo directory tree every N minutes.
Diff with previous tree using a combination of timestamps (can be faked by file copying tools, but not relevant in this case) and check-summation.
Merge changes with off-site FTP server.
Method two:
Register for directory change notifications (e.g., using ReadDirectoryChangesW from the WinAPI, or FileSystemWatcher if using .NET).
Log changes.
Merge changes with off-site FTP server every N minutes.
I'll probably end up using something like the second method due to performance considerations.
Problem
Since this synchronization must take place during business hours, the first problem that arises is during the off-site upload stage.
While I'm transferring a file off-site, I effectively need to prevent the users from writing to the file (e.g., use CreateFile with FILE_SHARE_READ or something) while I'm reading from it. The internet upstream speeds at their office are nowhere near symmetrical to the file sizes they'll be working with, so it's quite possible that they'll come back to the file and attempt to modify it while I'm still reading from it.
Possible Solution
The easiest solution to the above problem would be to create a copy of the file(s) in question elsewhere on the file-system and transfer those "snapshots" without disturbance.
The files (some will be binary) that these guys will be working with are relatively small, probably ≤20 MB, so copying (and therefore temporarily locking) them will be almost instant. The chances of them attempting to write to the file in the same instant that I'm copying it should be close to nil.
This solution seems kind of ugly, though, and I'm pretty sure there's a better way to handle this type of problem.
One thing that comes to mind is something like a file system filter that takes care of the replication and synchronization at the IRP level, kind of like what some A/Vs do. This is overkill for my project, however.
Questions
This is the first time that I've had to deal with this type of problem, so perhaps I'm thinking too much into it.
I'm interested in clean solutions that don't require going overboard with the complexity of their implementations. Perhaps I've missed something in the WinAPI that handles this problem gracefully?
I haven't decided what I'll be writing this in, but I'm comfortable with: C, C++, C#, D, and Perl.

After the discussions in the comments my proposal would be like so:
Create a partition on your data server, about 5GB for safety.
Create a Windows Service Project in C# that would monitor your data driver / location.
When a file has been modified then create a local copy of the file, containing the same directory structure and place on the new partition.
Create another service that would do the following:
Monitor Bandwidth Usages
Monitor file creations on the temporary partition.
Transfer several files at a time (Use Threading) to your FTP Server, abiding by the bandwidth usages at the current time, decreasing / increasing the worker threads depending on network traffic.
Remove the files from the partition that have successfully transferred.
So basically you have your drives:
C: Windows Installation
D: Share Storage
X: Temporary Partition
Then you would have following services:
LocalMirrorService - Watches D: and copies to X: with the dir structure
TransferClientService - Moves files from X: to ftp server, removes from X:
Also use multi threads to move multiples and monitors bandwidth.
I would bet that this is the idea that you had in mind but this seems like a reasonable approach as long as your really good with your application development and your able create a solid system that would handle most issues.
When a user edits a document in Microsoft Word for instance, the file will change on the share and it may be copied to X: even though the user is still working on it, within windows there would be an API see if the file handle is still opened by the user, if this is the case then you can just create a hook to watch when the user actually closes the document so that all there edits are complete, then you can migrate to drive X:.
this being said that if the user is working on the document and there PC crashes for some reason, the document / files handle may not get released until the document is opened at a later date, thus causing issues.

For anyone in a similar situation (I'm assuming the person who asked the question implemented a solution long ago), I would suggest an implementation of rsync.
rsync.net's Windows Backup Agent does what is described in method 1, and can be run as a service as well (see "Advanced Usage"). Though I'm not entirely sure if it has built-in bandwidth limiting...
Another (probably better) solution that does have bandwidth limiting is Duplicati. It also properly backs up currently-open or locked files. Uses SharpRSync, a managed rsync implementation, for its backend. Open source too, which is always a plus!

How do I Track Software Installs For Affiliates?

Right now I track installs of a particular piece of software through the use of a cookie. (This is all windows currently.) Since the download comes from my site through a link I can give custom links to affiliates to figure out who to credit the download with. I also assign a unique id of sorts to the computer to track it. These get popped into the registry at some point in the future for persistence.
Now, there are more ways to spread the software rather than just through a download link -- I want the ability to just hand someone an .exe and figure out who gave them that. I could have a .ini file or something to hold the tracking code but that means I have to create an installer for each affiliate -- I'm not completely against this idea but I don't like it either.
Any easier ways? I know this is a common use case -- what do people do?

so I've come up with a solution myself -- it's just going to take a second to implement (since all my server stuff lives on *nix right now) but here is how it goes:
affiliate comes to website and wishes to signup
we shoot a post w/affiliate-id over to a windows web server
windows web server generates a inno script config file w/affiliate id and other stuff in registry keys section
windows starts another process that compiles inno script file (and supporting stuff) into .exe
once it is done it is shot off to s3 bucket
user is informed where to dl his custom .exe
so, no problem is impossible, it just might take some thinking to get it done

How does the DropBox Mac client work?

I've been looking at the DropBox Mac client and I'm currently researching implementing a similar interface for a different service.
How exactly do they interface with finder like this? I highly doubt these objects represented in the folder are actual documents downloaded on every load? They must dynamically download as they are needed. So how can you display these items in finder without having actual file system objects?
Does anyone know how this is achieved in Mac OS X?
Or any pointer's to Apple API's or other open source projects that have a similar integration with finder?

Dropbox is not powered by either MacFUSE or WebDAV, although those might be perfectly fine solutions for what you're trying to accomplish.
If it were powered by those things, it wouldn't work when you weren't connected, as both of those rely on the server to store the actual information and Dropbox does not. If I quit Dropbox (done via the menu item) and disconnect from the net, I can still use the files. That's because the files are actually stored here on my hard drive.
It also means that the files don't need to be "downloaded on every load," since they are actually stored on my machine here. Instead, only the deltas are sent over the wire, and the Dropbox application (running in the background) patches the files appropriately. Going the other way, the Dropbox application watches for the files in the Dropbox folder, and when they change, it sends the appropriate deltas to the server, which propagates them to any other clients.
This setup has some decided advantages: it works when offline, it is an order of magnitude faster, and it is transparent to other apps, since they just see files on the disk. However, I have no idea how it deals with merge conflicts (which could easily arise with one or more clients offline), which are not an issue if the server is the only copy and every edit changes that central copy.
Where Dropbox really shines is that they have an additional trick that badges the items in the Dropbox folder with their current sync status. But that's not what you're asking about here.
As far as the question at hand, you should definitely look into MacFUSE and WebDAV, which might be perfect solutions to your problem. But the Dropbox way of doing things, with a background application changing actual files on the disk, might be a better tradeoff.

Dropbox is likely using FSEvents to watch for changes to the file system. It's a great API and can even bundle up changes that happened while your app was not running. It's the same API that Spotlight uses. The menubar app likely does the actual observing itself (since restarting it can fix uploads being hung, for instance).
There's no way they're using MacFUSE, as that would require installing the MacFUSE kernel extension to make Dropbox work, and since I definitely didn't install it, I highly doubt they're using it.

Two suggestions:
MacFUSE
WebDAV
The former will allow you to write an app that appears as a filesystem and does all the right things; the latter will allow you move everything server-side and let the user just mount your service as a file share.

Dropbox on the client is written in python.
The client seems to use a sqlite3 database to index files.
I suppose Dropobox split a file in chunks, to reduce bandwith usage.
By the way, it two people has the same file, even if they do not know each other, the server can optimize and avoid to transfer the file more times, only copying it on the server side

To me it feels like a heavily modified revision control system. It has all the features: updates files based on deltas, options to recover or restore old revisions of files. It almost feels like they are using git (GitFS?), or some filesystem they designed.

You could also give File Conveyor a try. It's a Python daemon capable of instantly detecting FS changes (on Linux through inotify, on OS X through FSEvents), processing the files and syncing them to one or more destinations.
Supported protocols: FTP, SFTP, Amazon S3 (CloudFront is also supported), Rackspace Cloud Files. Can easily be extended. Uses django-storages.
"processing files": e.g. optimizing images, transcoding videos — this was originally conceived to be used for sending static assets to a CDN in the context of speeding up websites)

Windows Help files - what are the options?

Back in the old days, Help was not trivial but possible: generate some funky .rtf file with special tags, run it through a compiler, and you got a WinHelp file (.hlp) that actually works really well.
Then, Microsoft decided that WinHelp was not hip and cool anymore and switched to CHM, up to the point they actually axed WinHelp from Vista.
Now, CHM maybe nice, but everyone that tried to open a .chm file on the Network will know the nice "Navigation to the webpage was canceled" screen that is caused by security restrictions.
While there are ways to make CHM work off the network, this is hardly a good choice, because when a user presses the Help Button he wants help and not have to make some funky settings.
Bottom Line: I find CHM absolutely unusable. But with WinHelp not being an option anymore either, I wonder what the alternatives are, especially when it comes to integrate with my Application (i.e. for WinHelp and CHM there are functions that allow you to directly jump to a topic)?
PDF has the disadvantage of requiring the Adobe Reader (or one of the more lightweight ones that not many people use). I could live with that seeing as this is kind of standard nowadays, but can you tell it reliably to jump to a given page/anchor?
HTML files seem to be the best choice, you then just have to deal with different browsers (CSS and stuff).
Edit: I am looking to create my own Help Files. As I am a fan of the "No Setup, Just Extract and Run" Philosophy, i had that problem many times in the past because many of my users will run it off the network, which causes exactly this problem.
So i am looking for a more robust and future-proof way to provide help to my users without having to code a different help system for each application i make.
CHM is a really nice format, but that Security Stuff makes it unusable, as a Help system is supposed to provide help to the user, not to generate even more problems.

HTML would be the next best choice, ONLY IF you would serve them from a public web server. If you tried to bundle it with your app, all the files (and images (and stylesheets (and ...) ) ) would make CHM look like a gift from gods.
That said, when actually bundled in the installation package, (instead of being served over the network), I found the CHM files to work nicely.
OTOH, another pitfall about CHM files: Even if you try to open a CHM file on a local disk, you may bump into the security block if you initially downloaded it from somewhere, because the file could be marked as "came from external source" when it was obtained.

I don't like the html option, and actually moved from plain HTML to CHM by compressing and indexing them. Even use them on a handful of non-Windows customers even.
It simply solved the constant little breakage of people putting it on the network (nesting depth limited, strange locking effects), antivirus that died in directories with 30000 html files, and 20 minutes decompression time while installing on an older system, browser safety zones and features, miscalculations of needed space in the installer etc.
And then I don't even include the people that start "correcting" them, 3rd party product with faulty "integration" attempts etc, complaints about slowliness (browser start-up)
We all had waited years for the problems to go away as OSes and hardware improved, but the problems kept recurring in a bedazzling number of varieties and enough was enough. We found chmlib, and decided we could forever use something based on this as escape with a simple external reader, if the OS provided ones stopped working and switched.
Meanwhile we also have an own compiler, so we are MS free future-proof. That doesn't mean we never will change (solutions with local web-servers seem favourite nowadays), but at least we have a choice.

Our software is both distributed locally to the clients and served from a network share. We opted for generating both a CHM file and a set of HTML files for serving from the network. Users starting the program locally use the CHM file, and users getting their program served from a network share has to use the HTML files.
We use Help and Manual and can thus easily produce both types of output from the same source project. The HTML files also contain searching capabilities and doesn't require a web server, so though it isn't an optimal solution, works fine.
So far all the single-file types for Windows seems broken in one way or another:
WinHelp - obsoleted
HtmlHelp (CHM) - obsoleted on Vista, doesn't work from network share, other than that works really nice
Microsoft Help 2 (HXS) - this seems to work right up until the point when it doesn't, corrupted indexes or similar, this is used by Visual Studio 2005 and above, as an example

If you don't want to use an installer and you don't want the user to perform any extra steps to allow CHM files over the network, why not fall back to WinHelp? Vista does not include WinHlp32.exe out of the box, but it is freely available as a download for both Vista and Server 2008.

It depends on how import the online documentation is to your product, a good documentation infrastructure can be complex to establish but once done it pays off. Here is how we do it -
Help source DITA compilant XML, stored in SCC (ClearCase).
Help editing XMetal
Help compilation, customized Open DITA Toolkit, with custom Perl/Java preprocessing
Help source cross references applications resources at compile time, .RC files etc
Help deliverables from single source, PDF, CHM, Eclipse Help, HTML.
Single source repository produces help for multiple products 10+ with thousands of shared topics.
From what you describe I would look at Eclipse Help, its not simple to integrate into .NET or MFC applications, you basically have to do the help mapping to resolve the request to a URL then fire the URL to Eclipse Help wrapper or a browser.

Is the question how to generate your own help files, or what is the best help file format?
Personally, I find CHM to be excellent. One of the first things I do when setting up a machine is to download the PHP Manual in CHM format (http://www.php.net/download-docs.php) and add a hotkey to it in Crimson Editor. So when I press F1 it loads the CHM and performs a search for the word my cursor is on (great for quick function reference).

If you are doing "just extract and run", you are going to run in security issues. This is especially true if you are users are running Vista (or later). is there a reason why you wanted to avoid packaging your applications inside an installer? Using an installer would alleviate the "external source" problem. You would be able to use .chm files without any problems.
We use InstallAware to create our install packages. It's not cheap, but is very good. If cost is your concern, WIX is open source and pretty robust. WIX does have a learning curve, but it's easy to work with.

PDF has the disadvantage of requiring the Adobe Reader
I use Foxit Reader on Windows at home and at work. A lot smaller and very quick to open. Very handy when you are wondering what exactly a80000326.pdf is and why it is clogging up your documents folder.

I think the solution we're going to end up going with for our application is hosting the help files ourselves. This gives us immediate access to the files and the ability to keep them up to date.
What I plan is to have the content loaded into a huge series of XML files, each one containing help for a specific item. This XML would contain links to other XML files. We would use XSLT to display the contents as necessary.
Depending on the licensing, we may build a client-specific XSLT file in order to tailor the look and feel to what they need. We may need to be able to only show help for particular versions of our product as well and that can be done by filtering out stuff in the XSLT.

I use a commercial package called AuthorIT that can generate a number of different formats, such as chm, html, pdf, word, windows help, xml, xhtml, and some others I have never heard of (does dita ring a bell?).
It is a content management system oriented towards the needs of technical documentation writers.
The advantage is that you can use and re-use the same content to build a set of guides, and then generate them in different formats.
So the bottom line relative to the question of choosing chm or html or whatever is that if you are using this you are not locked into a given format, but you can provide several among which the user can choose, and you can even add more formats as you go along, at no extra cost.
If you just have one guide to create it won't be worth your while, but if you have a documentation set to manage then it is the best to my knowledge. Their support is very helpful also.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio