CMS - Save pictures in database, What is the proper structure? - image

I currently build a CMS system that need to save a lot of pictures per article. I have a lot of questions :-)
I need to show the pictures in a few sizes, with or without watermark. In addition I need to have the original picture too, for archive and admin purpose. What that I think to do right now is to save the pictures in the database, in two versions: 1. the original picture, 2. web-optimized version.
It is really convenient way to save all the images in a table. But does it really good idea? Let say that the database will contain a hundred of thousand pictures, the original pictures size is probably around 3MB. so the db can be easily 100TB size.... Is this really good strategy?
On the other hand, I save a smaller version to each picture. This version need to be shown in a few sizes, with and without watermark. Currently I think to do think to this in on each request. the request will have parameters width, and according to this I can decide the size and the watermark. (I'll cache this work of course). Again, Is this a good strategy? do it really gonna work, or this is very expensive extra work?
Is it really better to save this on the db? I mean each request to article, will need around 50 another requests to its images, and each request required open/close connection to the database.
Technologies that I going to use: .net, sql-server 2008, NHibernate.

The best approach would be storing those images in filesystem and ids on database. Because of performance and maintenance reasons. Backing up and restoring would be much easier on filesystem and pushing the DBMS for such a work is not the best idea, you will need to transfer them from db to application and then push to the client. I just believe that's not it's job. Put a lighttpd daemon or something for image hosting and leave it do its job.
But if you like the idea, since you are going with sql server 2008, you can use FILESTREAM to store your images in your tables. Eventually, it will create files in a storage location that you choose and store the binary data in filesystem while providing transactional features and data integrity, it is a big bonus. Take a look at that option. As I remember, that performs good and the actual database will be much compact.
About the dynamic resizing, I say avoid that. Storage is cheaper than CPU time, just create variety of thumbnails and watermarked versions upon upload time and store them once in somewhere then use when required. Do not perform same operations again and again. You may do that at first request to the resized version, this way it will be easier to add new versions or purging the cache periodically to remove unused files. You will also be able to backup just the original versions.

Putting the images in the database has a couple of advantages. ACID tanscations and backup consistency come to mind. If you absolutely need that then put the images in the database. As you pointed out, this comes with a price: you'll need a huge database infrastructure like machines, licenses, operation team. Each image retrieval is a huge DB I/O effort.
A lot of things will be much easier with only storing metadata in the DB and putting the image blobs on a filesystem.
Two approaches to come to a decison:
What is the killer feature you absolutely (absolutely like in "if I don't have that, the whole thing will not work at all") need from the image-in-database approach? If there is one, go for it
Do a back-of-the-napkin business case, calculating the total cost of the image-in-database approach (project efforts, infrastructure, machine, license, operation) and compare that with an image-in-filesystem approach. That should give some hints on how to proceed.

Related

Why do software updates exist?

I know this may sound crazy but hear me out...
Say you have a game and you want to update it (add new features / redecorate for seasonal themes / add LTMs etc.) Now, instead of editing your code and then waiting days for your app market provider (Google/Microsoft/Apple etc.) to approve the update and roll out the changes, why not:
Put all of your code into a database
Remove all of your existing code from your code files
Add code which can run code from a database (reads it in and eval()s it)
This way, there'd be no need for software updates unless you wanted to change your database-related code, and you could simply update your database to change what the app does when it's live.
My Question: Why hasn't this been done?
For example:
Fortnite (a real game) often has LTMs (Limited Time Modes) which are available for a few weeks and are then removed. Generally, the software updates are ~ 5GB and take a lot of time unless your broadband is fast. If the code was fetched from a database and then executed, there'd be no need for these updates and the changes could be instantaneous.
EDIT: (In response to the close votes)
I'm looking for facts and statistics to back up reasons rather than just pure opinions. Answers like 'I think this would be good/bad ... ' aren't needed (that's why there's comments); answers like are 'This would be good/bad as this fact shows that ...' are much better and desired.
There are few challenges in your suggested approach.
putting everything in database will make increase the size of db. But that doest affect much.
If all the code is there in db, its possible to decompile your software and get a way to connect to your code?
too much performance overhead of evals. Precompiled code is optimized for their respective runtime
multiple versions? What to do when you want to have multiple versions of your software?
The main reason updates exists is they are easy to maintain, flexible and allows the developer to ship fastest optimized tool.
Put all of your code into a database
Remove all of your existing code from your code files
Add code which can run code from a database (reads it in and eval()s it)
My Question: Why hasn't this been done?
This is exactly how every game works already.
Each time you launch a game, an executable binary game engine (which you describe in step 3) already reads the rest of your code (often some embedded language like LUA) from a "database" (the file system) and "evals" (interprets) it to run the game, as well as assets like level geometry, textures, sounds and music.
You're talking about introducing a layer of abstraction (a real database) between the engine and its data to hide some of your assets, but the database stores its data on the file system, so you really haven't gained anything, you've just changed the way the data is encoded at rest and queried during runtime, and introduce a ton of overhead in both cases.
On the other hand, you're intentionally cheating your way through the app reviewal process this way, and whatever real technical problems you would have are moot, because your app would not be allowed in the app store. The entire point of the app reviewal process is to prevent people from shipping unverified unreviewed code to users, and if your program is obviously designed to circumvent this, your app will be rejected.
Fortnite (a real game) often has LTMs (Limited Time Modes) which are available for a few weeks and are then removed. Generally, the software updates are ~ 5GB and take a lot of time unless your broadband is fast.
Fotenite will have a small binary executable that is the game engine. Updates to this binary will account for a fraction of a percentage of that 5GB. The rest will be some kind of interpreted/embedded language describing the game's levels (also a tiny fraction) and then assets which account for the rest (geometry, textures, sound, music).
If the code was fetched from a database and then executed, there'd be no need for these updates and the changes could be instantaneous
This makes no sense. If you move that entire 5GB from the file system into a database, you still have to transfer around 5GB worth of database updates. 5GB of data in a database still lives as 5GB of data on the file system, it's just that you can't access it directly anymore. You have to transfer around the exact same amount of data, regardless of how you store it.

Store Images to display in SOLR search results

I have built a SOLR Index which has the image thumbnail urls that I want to render an image along with the search results. The problem is that those images can run into millions and I think storing the images in index as binary data would make the index humongous.
I am seeking guidance on how to efficiently store those images after rendering them from the URLs , should I use the plain file system and have them rendered by tomcat , or should I use a JCR repository like Apache Jackrabbit ?
Any guidance would be greatly appreciated.
Thank You.
I would evaluate the effective requiriments before finally deciding how to persist the images.
Do you require versioning?
Are you planning to stir eonly the images or additional metadata?
Do you have any requirements in horizontal scaling?
Do you require any image processing or scaling?
Do you need access to the image metatdata?
Do you require additional tooling for managing the images?
Are you willing to invest time in learning an additional technology?
Storing on the file system and making them available by an image sppoler implementation is the most simple way to persist your images.
But if you identify some of the above mentioned requirements (which are typical for a content repo or a dam system), then would end up reinventing the wheel with the filesystem approach.
The other option is using a kind of content repository. A JCR repo like for example Jackrabbit or it's commercial implementation CRX is one option. Alfresco (supports CMIS) would be the another valid.
Features like versioning, post processing (scaling ...), metadata extraction and management belong are supported by both mentioned repository solutions. But this requires you to learn a new technology which can be time consuming. Both mentioned repository technologies can get complex.
If horizontal scaling is a requirement I would consider a commercially supported repository implementations (CRX or Alfresco Enterprise) because the communty releases are lacking this functionality.
Me personally I would really depend any decision on the above mentioned requirements.
I extensively worked with Jackrabbit, CRX and Alfresco CE and EE and personally I would go for the Alfresco as I experienced it to scale better with larger amounts of data.
I'm not aware of a image pooling solution that fits your needs exactly but it shouldn't be to difficult to implement that, apart from the fact that recurring scaling operations may be very resource intensive.
I would go for the following approach if FS is enough for you:
Separate images and thumbnail into two locations.
The images root folder will remain, the thumbnails folder is
temporary.
Create a temporary thumbnail folder for each indexing run.
All thumbnails for that run are stored under that location, scaling
can be achived with i.e ImageMagick.
The temporary thumbnail folder can then easily be dropped as soon as
the next run has been completed.
If you are planning to store millions of images then avoid putting all files in the same directory. Browsing flat hierarchies with two many entries will be a nightmare.
Better create a tree structure by i.e. inverting the current datetime (year/month/day/hour/minute ... 2013/06/01/08/45).
This makes sure that the number of files inside the last folder get's not too big (Alfresco is using the same pattern for storing binary objects on the FS and it has proofen to work nicely).

Front and back end techniques to increase performance

What are some of the common and notable performance issues/bottlenecks that are typically encountered in a web application in both, the front-end layer, and the back-end layer?
An example of what I mean in a database is not having something you are querying on be an index. That would slow down the query. On the front-end it might be something funky going on with JavaScript that makes your application seem slow.
What are the general rules of thumb that help navigate such issues? And what are some good to-do's?
Thanks,
Alex
On front-end:
-push all of your assets - css files, images, static content - to a CDN. Edgecast is pretty good and reasonably priced.
-don't use load entire javascript frameworks when you only need a few features from it. only load what's needed.
On back-end
-memcache the results from all database calls by using a hash of the sql query as the key name, and the result set as the value
-make sure you are not making your database tables really 'wide' - tons of columns and column types like 'text' and 'blob'
For the front-end, there are well-known guidelines/rules you can follow, and there are some great tools like YSlow that can help you pinpoint the bottlenecks.
For the back-end, as you've noted, efficient use of indexes is a must. Other optimizations usually involve caching, and basic stuff like avoiding doing stuff within loops that can be done once. I'm sure people here will have suggestions, but remember "premature optimization is the root of all evil!" :-)
Millhouse is on to it. I can also add:
Batch expensive operations up. For example: don't make lots of individual calls to a database if you can do it all in one hit.
Avoid server hops where you can.
Process in parallel if you can (not so common for your 'average' web app but quite possible in larger Enterprise scale apps).
Pre-process: crunching data, pre-puiblishing content etc, the more you can do before it's needed the better.
Use a CQRS-based architecture. CQRS stands for Command/Query Responsability Segregation; it basically means that you have different code (services) for reading from the DB and writing to the DB. A good practice for scalability is to have separate DB's for reading and writing (it actually does make sense, if you read more about CQRS), and you can scale out the reading database by having copies run on multiple servers.
CQRS is not only interesting from a scalability point of view, but also from a code maintenance and clarity point of view. It does take some effort to learn about CQRS and understand it, though.
Check out these links:
http://www.slideshare.net/skillsmatter/ddd-exchange-2010-udi-dahan-on-architectural-innovation-cqrs
http://www.slideshare.net/pjvdsande/rethink-your-architecture-with-cqrs
convert dynamic contents to static contents. regenerate those static contents if their dependent objects changed. I saw one article said that more than 80 percent contents are static on Amazon website.

MVC-3 User-Image Management - Best Practices

Developing using MVC-3, Razor, C#
Been searching around and cannot find advice I'm looking for. My site will contain user-uploaded images (possibly a high number). What is the best practice for managing these pictures (placement, breakdown into sub-folders, etc...)? Where do I place them that will prevent them from getting accidentally blown away if I republish my site periodically?
If there are any good articles or blog posts, that would be helpful. Also, any advice/tips anyone wants to add would be great.
Thanks for your time!
Rob
EDIT
Also would like to know what people do to prevent hot linking.
A site that I run and has a high volume of images, has all of the images stored in a date folder structure. i.e. 2010/Dec/31/image.jpg
There are two reasons for this.
The first is the limited amount of DB space (200 MB) came with my hosting plan. Obviously if I had gigabytes of space I would have stored them in the DB.
The second reason is to keep the number of images in the folders to a minimum. Directory listings take longer with the more files that are contained in them so a new directory every 24 hours was my workaround.
Can you perhaps tell us more about what resources you have or how many images you estimate will be uploaded daily?
If you are using SQL Server 2005 or above you may use FileStreams. If the files are under 1MB in size you might even have better performance if you store them as VARBINARY(MAX). The best part about storing in the database is you may easily use transactions.
As for replication and backup you may use standard database replication and backup with the files.
If you have the space in your DB, then I recommend that, as backup/restore becomes much easier. If you have limited space for your DB, then a folder structure would work, though I would not store more than 1000 files in a single folder. So you'll want to come up with a solution that helps keep a folder from not holding more than 1000 images and folders in one place. If you think you'll have less than 1000 images per day, then a variation on what Sir Psycho suggested would probably work well which would be a folder for each year, then a sub folder under the year with month and day to store all the images for that day.
To answer your question about hot linking: your best bet is to check the referrer website (which should be found in the head of the request for the image) and make sure it's coming from your domain. If it's not, you can either not send back any information, or you send back an image that let's the user know they cannot see the image from the 3rd party site.
The header data can be spoofed, but odds are random visitors coming to the 3rd party site will not only not have done this, but probably wouldn't know/care how to.

Performance: Need to read from LONGTEXT

I'm building a CMS-type webapp that allows users to enter arbitrary-sized blocks of HTML. These blocks are entered by the user in their admin area and inserted into their template of choice when a page is delivered.
I'm guessing a user is not going to add more than 50-100 blocks and I'm not going to be getting more than 1000 users any time soon.
I was planning on using mySQL's LONGTEXT type to store these but I'm wondering if storing files in a directory will be more performant as the Linux OS will cache them? Given that I'm building for at most (1000 * 100) text blocks is there any reasonable performance worry with using mySQL?
Obviously I will be caching the HTML before delivery so I won't be reading these blocks on every delivery - reads will only occur when someone updates/creates new content.
I could use memcached/other cache/noSQL implementation or some other storage mechanism but I'm focusing on keeping it simple and delivering ASAP so don't want to introduce other stuff that I don't have experience with unless there's a significant performance worry.
Are the blocks of HTML content the only thing you are saving? If so, a file may be easiest.
However, it seems likely that you may want to save other bits of information along with the HTML and be able to query based on those bits of data. For example: date created, date last modified, name of the block, the user(s) who have edited the block.
If this is the case, then a database may be the best way to go. Since you said you do not expect to have many users (at least not a first) I would concentrate on finding the solution that is the fastest / most flexible to program and focus on performance and caching after your website begins to grow in size.
I advise you to use a flat file rather than Mysql to store this kind of data.
Html is more a "file" than a "value information" so it hasn't to be in a DB.
Moreover, you will certainly have better performances.
You can also read this post.

Resources