Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset? - performance

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.

IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.

Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Related

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
/api/article/
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates
You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

Should I protect my database from invalid data?

I always tend to "protect" my persistance layer from violations via the service layer. However, I am beginning to wonder if it's really necessary. What's the point in taking the time in making my database robust, building relationships & data integrity when it never actually comes into play.
For example, consider a User table with a unique contraint on the Email field. I would naturally want to write blocker code in my service layer to ensure the email being added isn't already in the database before attempting to add anything. In the past I have never really seen anything wrong with it, however, as I have been exposed to more & more best practises/design principles I feel that this approach isn't very DRY.
So, is it correct to always ensure data going to the persistance layer is indeed "valid" or is it more natural to let the invalid data get to the database and handle the error?
Please don't do that.
Implementing even "simple" constraints such as keys is decidedly non-trivial in a concurrent environment. For example, it is not enough to query the database in one step and allow the insertion in another only if the first step returned empty result - what if a concurrent transaction inserted the same value you are trying to insert (and committed) in between your steps one and two? You have a race condition that could lead to duplicated data. Probably the simplest solution for this is to have a global lock to serialize transactions, but then scalability goes out of the window...
Similar considerations exist for other combinations of INSERT / UPDATE / DELETE operations on keys, as well as other kinds of constraints such as foreign keys and even CHECKs in some cases.
DBMSes have devised very clever ways over the decades to be both correct and performant in situations like these, yet allow you to easily define constraints in declarative manner, minimizing the chance for mistakes. And all the applications accessing the same database will automatically benefit from these centralized constraints.
If you absolutely must choose which layer of code shouldn't validate the data, the database should be your last choice.
So, is it correct to always ensure data going to the persistance layer is indeed "valid" (service layer) or is it more natural to let the invalid data get to the database and handle the error?
Never assume correct data and always validate at the database level, as much as you can.
Whether to also validate in upper layers of code depends on a situation, but in the case of key violations, I'd let the database do the heavy lifting.
Even though there isn't a conclusive answer, I think it's a great question.
First, I am a big proponent of including at least basic validation in the database and letting the database do what it is good at. At minimum, this means foreign keys, NOT NULL where appropriate, strongly typed fields wherever possible (e.g. don't put a text field where an integer belongs), unique constraints, etc. Letting the database handle concurrency is also paramount (as #Branko Dimitrijevic pointed out) and transaction atomicity should be owned by the database.
If this is moderately redundant, than so be it. Better too much validation than too little.
However, I am of the opinion that the business tier should be aware of the validation it is performing even if the logic lives in the database.
It may be easier to distinguish between exceptions and validation errors. In most languages, a failed data operation will probably manifest as some sort of exception. Most people (me included) are of the opinion that it is bad to use exceptions for regular program flow, and I would argue that email validation failure (for example) is not an "exceptional" case.
Taking it to a more ridiculous level, imagine hitting the database just to determine if a user had filled out all required fields on a form.
In other words, I'd rather call a method IsEmailValid() and receive a boolean than try to have to determine if the database error which was thrown meant that the email was already in use by someone else.
This approach may also perform better, and avoid annoyances like skipped IDs because an INSERT failed (speaking from a SQL Server perspective).
The logic for validating the email might very well live in a reusable stored procedure if it is more complicated than simply a unique constraint.
And ultimately, that simple unique constraint provides final protection in case the business tier makes a mistake.
Some validation simply doesn't need to make a database call to succeed, even though the database could easily handle it.
Some validation is more complicated than can be expressed using database constructs/functions alone.
Business rules across applications may differ even against the same (completely valid) data.
Some validation is so critical or expensive that it should happen prior to data access.
Some simple constraints like field type/length can be automated (anything run through an ORM probably has some level of automation available).
Two reasons to do it. The db may be accessed from another application..
You might make a wee error in your code, and put data in the db, which because your service layer operates on the assumption that this could never happen, makes it fall over if you are lucky, silent data corruption being worst case.
I've always looked at rules in the DB as backstop for that exceptionally rare occasion when I make a mistake in the code. :)
The thing to remember, is if you need to , you can always relax a constraint, tightening them after your users have spent a lot of effort entering data will be far more problematic.
Be real wary of that word never, in IT, it means much sooner than you wished.

Transferring lots of objects with Guid IDs to the client

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

Performance: Need to read from LONGTEXT

I'm building a CMS-type webapp that allows users to enter arbitrary-sized blocks of HTML. These blocks are entered by the user in their admin area and inserted into their template of choice when a page is delivered.
I'm guessing a user is not going to add more than 50-100 blocks and I'm not going to be getting more than 1000 users any time soon.
I was planning on using mySQL's LONGTEXT type to store these but I'm wondering if storing files in a directory will be more performant as the Linux OS will cache them? Given that I'm building for at most (1000 * 100) text blocks is there any reasonable performance worry with using mySQL?
Obviously I will be caching the HTML before delivery so I won't be reading these blocks on every delivery - reads will only occur when someone updates/creates new content.
I could use memcached/other cache/noSQL implementation or some other storage mechanism but I'm focusing on keeping it simple and delivering ASAP so don't want to introduce other stuff that I don't have experience with unless there's a significant performance worry.
Are the blocks of HTML content the only thing you are saving? If so, a file may be easiest.
However, it seems likely that you may want to save other bits of information along with the HTML and be able to query based on those bits of data. For example: date created, date last modified, name of the block, the user(s) who have edited the block.
If this is the case, then a database may be the best way to go. Since you said you do not expect to have many users (at least not a first) I would concentrate on finding the solution that is the fastest / most flexible to program and focus on performance and caching after your website begins to grow in size.
I advise you to use a flat file rather than Mysql to store this kind of data.
Html is more a "file" than a "value information" so it hasn't to be in a DB.
Moreover, you will certainly have better performances.
You can also read this post.

efficient serverside autocomplete

First off all I know:
Premature optimization is the root of all evil
But I think wrong autocomplete can really blow up your site.
I would to know if there are any libraries out there which can do autocomplete efficiently(serverside) which preferable can fit into RAM(for best performance). So no browserside javascript autocomplete(yui/jquery/dojo). I think there are enough topic about this on stackoverflow. But I could not find a good thread about this on stackoverflow (maybe did not look good enough).
For example autocomplete names:
names:[alfred, miathe, .., ..]
What I can think off:
simple SQL like for example: SELECT name FROM users WHERE name LIKE al%.
I think this implementation will blow up with a lot of simultaneously users or large data set, but maybe I am wrong so numbers(which could be handled) would be cool.
Using something like solr terms like for example: http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=al&wt=json&omitHeader=true.
I don't know the performance of this so users with big sites please tell me.
Maybe something like in memory redis trie which I also haven't tested performance on.
I also read in this thread about how to implement this in java (lucene and some library created by shilad)
What I would like to hear is implementation used by sites and numbers of how well it can handle load preferable with:
Link to implementation or code.
numbers to which you know it can scale.
It would be nice if it could be accesed by http or sockets.
Many thanks,
Alfred
Optimising for Auto-complete
Unfortunately, the resolution of this issue will depend heavily on the data you are hoping to query.
LIKE queries will not put too much strain on your database, as long as you spend time using 'EXPLAIN' or the profiler to show you how the query optimiser plans to perform your query.
Some basics to keep in mind:
Indexes: Ensure that you have indexes setup. (Yes, in many cases LIKE does use the indexes. There is an excellent article on the topic at myitforum. SQL Performance - Indexes and the LIKE clause ).
Joins: Ensure your JOINs are in place and are optimized by the query planner. SQL Server Profiler can help with this. Look out for full index or full table scans
Auto-complete sub-sets
Auto-complete queries are a special case, in that they usually works as ever decreasing sub sets.
'name' LIKE 'a%' (may return 10000 records)
'name' LIKE 'al%' (may return 500 records)
'name' LIKE 'ala%' (may return 75 records)
'name' LIKE 'alan%' (may return 20 records)
If you return the entire resultset for query 1 then there is no need to hit the database again for the following result sets as they are a sub set of your original query.
Depending on your data, this may open a further opportunity for optimisation.
I will no comply with your requirements and obviously the numbers of scale will depend on hardware, size of the DB, architecture of the app, and several other items. You must test it yourself.
But I will tell you the method I've used with success:
Use a simple SQL like for example: SELECT name FROM users WHERE name LIKE al%. but use TOP 100 to limit the number of results.
Cache the results and maintain a list of terms that are cached
When a new request comes in, first check in the list if you have the term (or part of the term cached).
Keep in mind that your cached results are limited, some you may need to do a SQL query if the term remains valid at the end of the result (I mean valid if the latest result match with the term.
Hope it helps.
Using SQL versus Solr's terms component is really not a comparison. At their core they solve the problem the same way by making an index and then making simple calls to it.
What i would want to know is "what you are trying to auto complete".
Ultimately, the easiest and most surefire way to scale a system is to make a simple solution and then just scale the system by replicating data. Trying to cache calls or predict results just make things complicated, and don't get to the root of the problem (ie you can only take them so far, like if each request missed the cache).
Perhaps a little more info about how your data is structured and how you want to see it extracted would be helpful.

Resources