Performance: Need to read from LONGTEXT - performance

I'm building a CMS-type webapp that allows users to enter arbitrary-sized blocks of HTML. These blocks are entered by the user in their admin area and inserted into their template of choice when a page is delivered.
I'm guessing a user is not going to add more than 50-100 blocks and I'm not going to be getting more than 1000 users any time soon.
I was planning on using mySQL's LONGTEXT type to store these but I'm wondering if storing files in a directory will be more performant as the Linux OS will cache them? Given that I'm building for at most (1000 * 100) text blocks is there any reasonable performance worry with using mySQL?
Obviously I will be caching the HTML before delivery so I won't be reading these blocks on every delivery - reads will only occur when someone updates/creates new content.
I could use memcached/other cache/noSQL implementation or some other storage mechanism but I'm focusing on keeping it simple and delivering ASAP so don't want to introduce other stuff that I don't have experience with unless there's a significant performance worry.

Are the blocks of HTML content the only thing you are saving? If so, a file may be easiest.
However, it seems likely that you may want to save other bits of information along with the HTML and be able to query based on those bits of data. For example: date created, date last modified, name of the block, the user(s) who have edited the block.
If this is the case, then a database may be the best way to go. Since you said you do not expect to have many users (at least not a first) I would concentrate on finding the solution that is the fastest / most flexible to program and focus on performance and caching after your website begins to grow in size.

I advise you to use a flat file rather than Mysql to store this kind of data.
Html is more a "file" than a "value information" so it hasn't to be in a DB.
Moreover, you will certainly have better performances.
You can also read this post.

Related

When is it better to generate a static page or dynamically generate?

The title pretty much sums up my question.
When is it more efficient to generate a static page, that a user can access, as apposed to using dynamically generated pages that query a database? As in what situations would one be better than the other.
To serve up a static page, your web server just needs to read the page off the disk and send it. Virtually no processing will be required. If the page is frequently accessed, it will probably be cached in memory, so even the disk access will not be needed.
Generating pages dynamically obviously has more overhead. There is a cost for every DB access you make, no matter how simple the query is. (On a project I worked on recently, I measured a minimum overhead of 0.7ms for each query, even for SELECT 1;) So if you can just generate a static page and save it to disk, page accesses will be faster. How much faster? It just depends on how much work is being done to generate the page dynamically. We don't know what you are doing, so we can't comment on that.
Now, if you generate a static page and save it to disk, that means you need to re-generate it every time the data which went into generating that page changes. If the data changes more often than the page is actually accessed, you could be doing more work rather than less! But in most cases, that's a very unlikely situation.
More likely, the biggest problem you will experience from generating static pages and saving them to disk is coding (and maintaining) the logic for re-generating the pages whenever necessary. You will need to keep track of exactly what data goes into each page, and in every place in the code where data can be changed, you will need to invoke re-generation of all the relevant pages. If you forget just one, then your users may be looking at stale data some of the time.
If you mix dynamic generation per-request and caching generated pages on disk, then your code will be harder to read and maintain, because of mixing the two styles.
And you can't really cache generated pages on disk in certain situations -- like responding to POST requests which come from a form submission. Or imagine that when your users invoke certain actions, you have to send a request to a 3rd party API, and the data which comes back from that API will be used in the page. What comes back from the API may be different each time, so in this case, you need to generate the page dynamically each time.
Static pages (or better resources) are filled with content, that does not change or at least not often, and does not allow further queries on it: About Page, Contact, ...
In this case it doesn't make any sense to query these pages. On the other side we have Data (e.g. in a Database) and want to query it/give the user the opportunity to query it. In this case you give the User a page with the possibility to specify the query and return a rendered page with the dynamically generated data.
In my opinion it depends on the result you want to present to the user. Either it is only an information or it is the possibility to query a Datasource. The first result is known before you do something, the second (query data) is known after you have the query parameters, which means you don't know the result beforehand (it could be empty or invalid).
It depends on your architecture, but when you consider that GET Requests should be idempotent it should be also easy to cache dynamic Pages with a Proxy, and invalidate the cache, when something new happens to the data which is displayed on the cached path. In this case one could save a lot of time, because the system behaves like the cached pages would be static, but instead coming from the filesystem, they come from your memory, which is really fast.
Cheers
Laidback

Transferring lots of objects with Guid IDs to the client

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

fastest etag algorithm

We want to make use of http caching on our website - in particular content validation.
Because our CMS constructs pages from smaller fragments of content, the last modified date of the actual page is not always an accurate indicator that the page has changed. Hence we also want to make use of etags. Because page construction is based on lots of other page fragments we think the only real way to provide an accurate etag is by performing some sort of digest on the content stream itself. This seems a little over cooked as caching is supposed to ease the load off the servers but a content digest is obviously CPU intensive.
I'm looking for the fastest algorithm to create a unique etag that is relevant to the content stream (inode etc just is a kludge and wont work). An MD5 hash is obviously going to get the best unique result but is anybody else making use of other algorithms that are faster in a similar situation?
Sorry forgot the important details... Using Java Servlets - running in websphere 6.1 on windows 2003.
I forgot to mention that there are also live database feeds (we're a bank and need to make sure interest rates are up to date) that can also change the content. So figuring out when content has changed can be tricky to determine.
I would generate a checksum for each fragment, but compute it when the fragment is changed, not when you render the page.
This way, you pay a one-time cost, which should be relatively small, unless we're talking hundreds of changes per second, and there is no additional cost per request.

CMS - Save pictures in database, What is the proper structure?

I currently build a CMS system that need to save a lot of pictures per article. I have a lot of questions :-)
I need to show the pictures in a few sizes, with or without watermark. In addition I need to have the original picture too, for archive and admin purpose. What that I think to do right now is to save the pictures in the database, in two versions: 1. the original picture, 2. web-optimized version.
It is really convenient way to save all the images in a table. But does it really good idea? Let say that the database will contain a hundred of thousand pictures, the original pictures size is probably around 3MB. so the db can be easily 100TB size.... Is this really good strategy?
On the other hand, I save a smaller version to each picture. This version need to be shown in a few sizes, with and without watermark. Currently I think to do think to this in on each request. the request will have parameters width, and according to this I can decide the size and the watermark. (I'll cache this work of course). Again, Is this a good strategy? do it really gonna work, or this is very expensive extra work?
Is it really better to save this on the db? I mean each request to article, will need around 50 another requests to its images, and each request required open/close connection to the database.
Technologies that I going to use: .net, sql-server 2008, NHibernate.
The best approach would be storing those images in filesystem and ids on database. Because of performance and maintenance reasons. Backing up and restoring would be much easier on filesystem and pushing the DBMS for such a work is not the best idea, you will need to transfer them from db to application and then push to the client. I just believe that's not it's job. Put a lighttpd daemon or something for image hosting and leave it do its job.
But if you like the idea, since you are going with sql server 2008, you can use FILESTREAM to store your images in your tables. Eventually, it will create files in a storage location that you choose and store the binary data in filesystem while providing transactional features and data integrity, it is a big bonus. Take a look at that option. As I remember, that performs good and the actual database will be much compact.
About the dynamic resizing, I say avoid that. Storage is cheaper than CPU time, just create variety of thumbnails and watermarked versions upon upload time and store them once in somewhere then use when required. Do not perform same operations again and again. You may do that at first request to the resized version, this way it will be easier to add new versions or purging the cache periodically to remove unused files. You will also be able to backup just the original versions.
Putting the images in the database has a couple of advantages. ACID tanscations and backup consistency come to mind. If you absolutely need that then put the images in the database. As you pointed out, this comes with a price: you'll need a huge database infrastructure like machines, licenses, operation team. Each image retrieval is a huge DB I/O effort.
A lot of things will be much easier with only storing metadata in the DB and putting the image blobs on a filesystem.
Two approaches to come to a decison:
What is the killer feature you absolutely (absolutely like in "if I don't have that, the whole thing will not work at all") need from the image-in-database approach? If there is one, go for it
Do a back-of-the-napkin business case, calculating the total cost of the image-in-database approach (project efforts, infrastructure, machine, license, operation) and compare that with an image-in-filesystem approach. That should give some hints on how to proceed.

Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset?

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Resources