Image Server Performance - image

I have read many questions/comments regarding saving the image in DB or file system on server side. However i'm still confused. For now I allow user to upload image (limit to 10MB) and I save the image in the server folder and serve the image via apache context path configuration pointed to that location. However, due to the numbers of image and high load. We want to provide load balancing and fail over functionality. So I have 2 options.
Add code to replicate the uploaded image to all servers or using rsync to do that.
Using CouchDB or MongoDB and save the image as attachment of an document. So I have out of the box replicate functionality.
Can anyone show me the pros/cons of these approach. Can CouchDB/MongoDB have the same read performance compared to file system ?

You can also store files in distributed file system. The benefit over DB supported image server is you do not have to alter the application. Obviously, storing all the data the same way, including images, may be a benefit for you, but changing architecture for already working system may also be problematic.
For example, GlusterFS may be installed on top of "normal" file system to give you distributed features minimizing changes to the system itself. It is supposed to support via its plugins (translators) all the feature you would potentially expect from cloud system: replication, load balancing, stripping of files into relocated parts and fail-over.

Can CouchDB/MongoDB have the same read performance compared to file system ?
No, there will be lag between file system timers and database timers, this is an unfortunately reality.
I have no idea of your current setup, load and performance so I cannot really advise on what to do, however, Apache isn't really a good image server anyway.
Your best bet might be to look into a CDN cache for your images.

Related

Is it advisable to use Redis or Memcached as a cache for FILES?

I have multiple configuration files which I need to read from disk and apply to many records.
I need to improve this to increase performance.
I have two processes.
Process1: Update Configuration:
This updates content configuration files.
This can run from multiple locations.
Process2: Apply Configuration:
This uses content of configuration files.
This can run from multiple locations.
At present, this is using direct file+n/w IO to read updated configuration files.
Both processes are back-end and there is no browser involved here.
Should I use Redis or Memcached as a cache for FILES ?
Note that file need to be read from a common location. They are being updated by another background process. Update can happen any time. Size of configuration files is 1K to 10K.
I want Process2 to access updated configuration files in fastest way possible.
Redis is good choice as it preserves data in memory with optional persistence. So such approach does not have to touch hard drive.
The problem I can see here that every client needs to understand Redis and is to use some support library, e.g. in Java or whatever language you use.
Why to not use http itself, e.g. deploy some http file server. You can also provide version checking + caching, so client can store version of file on the server and use client-cache content if the server has same file and download it when it was changed. This is called HEAD, look at http://www.tutorialspoint.com/http/http_methods.htm
You just should use same approach as web itself has. Every browser downloads content, html, css, images etc. Best improvement, for you, is client side caching, e.g. css or images are stored in browsers cache and download only first type or when it was changed.
And if you dont want, you cant use exactly REST approach itself.

Store user profile pictures on disk or in the database?

I'm building an asp.net MVC application where users can attach a picture to their profile, but also in other areas of the system like a messaging gadget on the dashboard that displays recent messages etc.
When the user uploads these I am wondering whether it would be better to store them in the database or on disk.
Database advantages
Easy to backup the entire database and keep profile content/images with associated profile/user tables
when I build web services later down the track, they can just pull all the profile related data from one spot(the database)
Filesystem advantages
loading files from disk is probably faster
any other advantages?
Where do other sites store this sort of information? Am I right to be a little concerned about database performance for something like this?
Maybe there would be a way to cache images pulled out from the database for a period of time?
Alternatively, what about the idea of storing these images in the database, but shadow copying them to disk so the web server can load them from there? This would seem to give both the backup and convenience of a Db, whilst giving the speed advantages of files on disk.
Infrastructure in question
The website will be deployed to IIS on windows server 2003 running NTFS file system.
The database will be SQL Server 2008
Summary
Reading around on a lot of related threads here on SO, many people are now trending towards the SQL Server Filestream type. From what I could gather however (I may be wrong), there isn't much benefit when the files are quite small. Filestreaming however looks to greatly improve performance when files are multiple MB's or larger.
As my profile pictures tend to sit around ~5kb I decided to just leave them stored in a filestore in the database as varbinary(max).
In ASP.NET MVC I did see a bit of a performance issue returning FileContentResults for images pulled out of the database like this. So I ended up caching the file on disk when it is read if the location to this file is not found in my application cache.
So I guess I went for a hybrid;
Database storage to make baking up of data easier and files are linked directly to profiles
Shadow copying to disk to allow better caching
At any point I can delete the cache folder on disk, and as the images are re-requested they will be re-copied on first hit and served from the cache there after.
You should store a reference to the files on a database and store the actual files on disk.
This approach is more flexible and easier to scale.
You can have a single database and several servers serving static content. It will be much trickier to have several databases doing that work.
Flickr works this way.
I gave a more detailed answer here, you may find it useful.
Actually your datastore look up with the database may actually be faster depending on the number of images you have, unless you are using highly optimized filesystem engine. Databases are designed for fast lookups and use a LOT more interesting techniques than a file system does.
reiserfs (obsolete) really awesome for lookups, zfs, xfs and NTFS all have fantastic hashing algorithms, linux ext4 looks promising too.
The hit on the system is not going to be any different in terms of block reads. The question is what is faster a query lookup that returns the filename (may be a hash?) which in turn is accessed using a separate open, filesend close? or just dumping the blob out?
There are several things to consider, including network hit, processing hit, distributability etc. If you store stuff in the database, then you can move it. Then again, if you store images on a content delivery service that may be WAY faster since you are not doing any network hits on yrouself.
Think about it, and remember bit of benchmarking never hurt nobody :-) so test it out with your typical dataset size and take into account things like simultaneous queries etc.

What are the best practices for image serving?

What techniques do people commonly use for uploading, storing and presenting images with a CMS?
Do you store them in the database or on the file system?
Do you generate thumbnails on upload? Or on the fly, then maybe cache them for reuse? Or rely on browser scaling?
Typically, most content management systems will store images the actual data of image uploads to the file systems and then add a link to the file within the database. Thumbnails can either be generated on upload or on first request (on the fly is considered inefficient, especially given the cheap cost of storage). Browser scaling is a bad idea (images may be uploaded as multi megabyte uncompressed files) but is done by some systems.
i agree with kevin. i can't think of any cms that doesn't store in the file system. then only issue that comes up with that technique is if you are planning on clustering multiple web servers to run your cms. if thats the case then you have to plan on it and have the ability to point all the web servers to the same file storage location.
the technique ive used for years is on upload, resize the image to something practical for the web, then generate the thumbnail, then write them to the file system and record the pointer in the database.
if the site is a huge site then you need serve the images from cache servers because file systems are very slow in comparison to network IO. take facebook for example, they have billions of images on their site and last i heard 80% were held in cache servers around the world in ram. the file storage array they have is more or less a backup to the cache servers.

What is the best way to manage user photos for a website?

My question is about displaying thumbnails and storage.
Let's say I have a website where users can upload photos and view them in albums.
How are the photos usually stored in this scenario? Are the images themselves or are the file paths usually stored in the database?
If the photos are large and you want to display thumbnails, is it better to:
save a copy of the image and a reduced size image, only displaying the larger if requested?
use HTML to reduce the size?
It's almost always a bad idea to store images in a database. BLOBs can really slow down a database something fierce. It also limits your ability to spread storage around different drives. When the files are separate, you can even have one or more separate image servers to reduce the load on the main dynamic server. My recommendations are:
In your database table, have columns for both the directory the image resides in and the image name. That way you are free to change where images are stored, round-robin drives, add more storage later and put new images in the new storage, or whatever you want. Storing the path and the filename in separate fields makes it trivial to move images from one directory to another.
You definitely want to generate thumbnail images to reduce your network bandwidth and make your application run faster. However, you can generate the thumbnails on demand, or when the system load is low. If you're on Linux, ImageMagick is wonderful at automated batch resizing of images. It can even resize by a percentage instead of an absolute amount.
Some software such as TikiWiki stores the photos in a database. It then also caches thumbnail sized photos in the database.
Other software stores it in a directory. This is the way Gallery2 operates. I find the directory approach more scaleable. If a different size than the original is requested, typically the app will use ImageMagick to resize the photo, and then store a copy of the resized photo.
Another alternative is to re-upload the photo to a service like S3, and not store the photo locally at all.
This is common question and the basic answer is that it depends. You need to give more information. What database are you planning on using? SQL Server 2008 has some good new features for handling this scenario with FILESTREAM function. Generally I prefer to put them in the database, but if you just stuff them in their without thinking about design and access requirements you could have poor performance as the number of photos increases.
IF you are absolutely positively sure that your web server will always have access to the file system hosting the images, then go that route. Maybe.
However, if at any time you think you might need to, i don't know, create an image server because the hard drive on your web server is running out of space OR that you need to run multiple web servers, then save yourself the trouble and store them in a database. The hard part in storing on a file system is the security requirements of crossing the network.
Also, bear in mind that not all database servers are created equal in this regard. SQL 2008 introduced a FILESTREAM data type which actually stores the images on the local file system while allowing all read / write access through the db server. This has the added benefit of allowing you to run virus scanners on the incoming files while in storage.
Oracle has had some nice file storage facilities for awhile now. MySQL? I don't think I'd want to try, but you might be okay.
As to the second question: save a thumbnail along with the image. This process occurs only once per image and saves on presentation bandwidth. Using HTML to size an image down really does nothing for the client.

Images in load balanced environment

I have a load balanced enviorment with over 10 web servers running IIS. All websites are accessing a single file storage that hosts all the pictures. We currently have 200GB of pictures - we store them in directories of 1000 images per directory. Right now all the images are in a single storage device (RAID 10) connected to a single server that serves as the file server. All web servers are connected to the file server on the same LAN.
I am looking to improve the architecture so that we would have no single point of failure.
I am considering two alternatives:
Replicate the file storage to all of the webservers so that they all access the data locally
replicate the file storage to another storage so if something happens to the current storage we would be able to switch to it.
Obviously the main operations done on the file storage are read, but there are also a lot of write operations. What do you think is the preferred method? Any other idea?
I am currently ruling out use of CDN as it will require an architecture change on the application which we cannot make right now.
Certain things i would normally consider before going for arch change is
what are the issues of current arch
what am i doing wrong with the current arch.(if this had been working for a while, minor tweaks will normally solve a lot of issues)
will it allow me to grow easily (here there will always be a upper limit). Based on the past growth of data, you can effectively plan it.
reliability
easy to maintain / monitor / troubleshoot
cost
200GB is not a lot of data, and you can go in for some home grown solution or use something like a NAS, which will allow you to expand later on. And have a hot swappable replica of it.
Replicating to storage of all the webservers is a very expensive setup, and as you said there are a lot of write operations, it will have a large overhead in replicating to all the servers(which will only increase with the number of servers and growing data). And there is also the issue of stale data being served by one of the other nodes. Apart from that troubleshooting replication issues will be a mess with 10 and growing nodes.
Unless the lookup / read / write of files is very time critical, replicating to all the webservers is not a good idea. Users(of web) will hardly notice the difference of 100ms - 200ms in loadtime.
There are some enterprise solutions for this sort of thing. But I don't doubt that they are expensive. NAS doesn’t scale well. And you have a single point of failure which is not good.
There are some ways that you can write code to help with this. You could cache the images on the web servers the first time they are requested, this will reduce the load on the image server.
You could get a master slave set up, so that you have one main image server but other servers which copy from this. You could load balance these, and put some logic in your code so that if a slave doesn’t have a copy of an image, you check on the master. You could also assign these in priority order so that if the master is not available the first slave then becomes the master.
Since you have so little data in your storage, it makes sense to buy several big HDs or use the free space on your web servers to keep copies. It will take down the strain on your backend storage system and when it fails, you can still deliver content for your users. Even better, if you need to scale (more downloads), you can simply add a new server and the stress on your backend won't change, much.
If I had to do this, I'd use rsync or unison to copy the image files in the exact same space on the web servers where they are on the storage device (this way, you can swap out the copy with a network file system mount any time).
Run rsync every now and then (for example after any upload or once in the night; you'll know better which sizes fits you best).
A more versatile solution would be to use a P2P protocol like Bittorreent. This way, you could publish all the changes on the storage backend to the web servers and they'd optimize the updates automatcially.

Resources