TClientDataSet works VERY SLOW with 100K+ rows - performance

i have problem retrieving data with Delphi TClientDataSet
Code with ADO:
ADOQuery1.SQL.Text:='SELECT * FROM Table1 WITH (NoLock)';
DataSource1.DataSet:=ADOQuery1;
DataSource1.DataSet.Open;
DataSource1.DataSet.Last;
Code above returns over 180k rows in 3-6 seconds when using pure ADO.
Same code with TClientDataSet:
ADOQuery1.SQL.Text:='SELECT * FROM Table1 WITH (NoLock)';
CDS1.SetProvider(ADOQuery1);
DataSource1.DataSet:=CDS1;
DataSource1.DataSet.Open;
DataSource1.DataSet.Last;
Following code returns same amount of rows(over 180k) but within 3-4minutes.
What's wrong with CDS? It's about 100-times slower then using ADO.
Is it possible to fix it?

Code above returns over 180k rows in 3-6 seconds when using pure ADO.
For some reasons, I wouldn't expect the code you posted to return every single of the 180k records... I'd expect to see the first "X" records loaded after TADOQuery.Open is called, and then the last "X" records sent when TADOQuery.Last is called. Going while not EoF do instead of ".Last" would probably be better performance test since (I assume) you actually want to browse all the records.
When calling TClientDataset.Last when linked to a DataProvider, it most likely do the equivalent of a while not EoF do on your query, which transfer all the 180k records. Also, TClientDataset Insert/Append operation tend to get slower and slower the more records you have in it. My best guess is that it has to realloc it's memory buffer every now and then. If that's the case, I haven't found a way to tell the TClientDataset: "Hey! Brace yourself, 180k records incoming!"(Akin to TList.SetCapacity).
If you have an older version of delphi, one thing that could help is the Midas Speed Fix.

This is pretty old but there are a lot of new Delphi programmers out there now. Here is a bit of a scoop.
When using CDS in delphi you are in fact creating a memory table. Your query probably went to swap hell.
To get the most out of a CDS use DBX components to grab data. They are so called 'fast forward' cursors that dont create a temp table with a cursor in the database. forward only doesnt do the fancy stuff that ADO does.
If you need massive datasets with total update notifications and full control, then use ADO. If you need to plow through a ton of data in a hurry with little load on the server that is where CDS/DBX shines.
The way of DBX is harder. It is just a dragster. First and Next are the only thing that works for them. No updates, no commitments, just a fast one way relationship. Hook up a DBX/provider/CDS combo and you have it all. Speed and edit ability. Use a version number to detect another user doing something to the data while you were editing. Study the provider options to see how to get power with flexibility. This is pretty much as hard core as it gets in Delphi.

CDS1.LogChanges property = false or true
its the same problem

Try setting the CDS1.LogChanges property to False before loading the data. This needs to be done in code as it is not a published property.
From the helpfile: For large datasets, a value of True for LogChanges may severely impact the performance of the application.
You can then turn it on after the initial load.

Related

How to do Lazy Map deserialization in Haskell

Similar to this question by #Gabriel Gonzalez: How to do fast data deserialization in Haskell
I have a big Map full of Integers and Text that I serialized using Cerial. The file is about 10M.
Every time I run my program I deserialize the whole thing just so I can lookup an handful of the items. Deserialization takes about 500ms which isn't a big deal but I alway seem to like profiling on Friday.
It seems wasteful to always deserialize 100k to 1M items when I only ever need a few of them.
I tried decodeLazy and also changing the map to a Data.Map.Lazy (not really understanding how a Map can be Lazy, but ok, it's there) and this has no effect on the time except maybe it's a little slower.
I'm wondering if there's something that can be a bit smarter, only loading and decoding what's necessary. Of course a database like sqlite can be very large but it only loads what it needs to complete a query. I'd like to find something like that but without having to create a database schema.
Update
You know what would be great? Some fusion of Mongo with Sqlite. Like you could have a JSON document database using flat-file storage ... and of course someone has done it https://github.com/hamiltop/MongoLiteDB ... in Ruby :(
Thought mmap might help. Tried mmap library and segfaulted GHCI for the first time ever. No idea how can even report that bug.
Tried bytestring-mmap library and that works but no performance improvement. Just replacing this:
ser <- BL.readFile cacheFile
With this:
ser <- unsafeMMapFile cacheFile
Update 2
keyvaluehash may be just the ticket. Performance seems really good. But the API is strange and documentation is missing so it will take some experimenting.
Update 3: I'm an idiot
Clearly what I want here is not lazier deserialization of a Map. I want a key-value database and there's several options available like dvm, tokyo-cabinet and this levelDB thing I've never seen before.
Keyvaluehash looks to be a native-Haskell key-value database which I like but I still don't know about the quality. For example, you can't ask the database for a list of all keys or all values (the only real operations are readKey, writeKey and deleteKey) so if you need that then have to store it somewhere else. Another drawback is that you have to tell it a size when you create the database. I used a size of 20M so I'd have plenty of room but the actual database it created occupies 266M. No idea why since there isn't a line of documentation.
One way I've done this in the past is to just make a directory where each file is named by a serialized key. One can use unsafeinterleaveIO to "thunk" the deserialized contents of each read file, so that values are only forced on read...

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

Excel List-Object VBA Performance Bug?

I have an issue with performance on an excel application which uses List Objects (AKA Excel Tables). I suspect it may be a bug, but despite my Googling I could not find any reference of it. I've already developed a workaround for my application, but what I’m interested in is if anyone can give any insight into why this happens.
Note: I’m using Excel 2007 on Windows Vista. The setup is as follows: I have a spreadsheet which holds data in a List Object, with VBA code which can be kicked off via a command button; this code may make several edits to any number of cells on the worksheet, so Excel’s Calculation mode is set to Manual prior to any edits.
The problem I’ve encountered is that if the currently active cell is within the List Object, then setting the Calculation Mode to manual seems to have no effect whatsoever. So if a user happens to have a heavy calculation workbook open in the same instance, then the VBA code runs very slowly. I practically had to pull my application apart to discover that this was caused by the active cell; and I created a new workbook with simple version of this scenario to confirm that there wasn’t some sort of corruption on my application.
I’ve been doing a number of test cases with this, and below are the results from what I’ve found:
Although it seems generally related to the calculation, there is still a time difference when the calculation mode is switched between Manual and Automatic...
Manual = 7.64 secs
Automatic = 9.39 secs
Manual mode is just fewer than 20% faster than Automatic. But my expectation was they’d be more or less the same, considering the issue seems to be the calculation kicking off even when in Manual mode.
Compare that to when the active cell is not on a List Object, and the results are vastly different...
Manual = 0.14 secs
Automatic = 3.23 secs
Now, the Manual run is 50 times faster, and Automatic run shows that the calculation shouldn’t have taken any more than 3.2 secs! So now the first test looks like it might have run the Calculation twice while in Manual mode, and nearly 3 times while in Automatic mode.
Repeating this test again, this time in an instance with no calculation formula in any cells, and suddenly it doesn’t seem as bad,
Active cell is List Object & Calc is Manual = 0.17 secs
Active cell is List Object & Calc is Automatic = 0.20 secs
Active cell is Empty & Calc is Manual = 0.14 secs
Active cell is Empty & Calc is Automatic = 0.18 secs
It’s still slower, but now it’s only by 10-20%, making it unnoticeable. But this does show that the issue must be related to the Calculation in some way, as otherwise it should have taken just as long as the first test.
If anyone wants to create these tests to see for themselves, the setup is as follows:
New Workbook with a List Object added (doesn’t have to be linked to any data)
Add some formula that will take excel a while to calculate (I just did ‘=1*1’ repeated 30,000 times)
Write a quick VBA code which will; (i) loop through a simple edit of a cell several hundred times, (ii) and record the time it took
Then just run the code while changing the active cell between the List Object and an empty cell
I’d be very interested to hear if anyone can explain why Excel behaves in this way, and if is a bug or if is some feature to do with List Objects which actually has some genuine use?
Thanks,
Stuart
This is not relative to the "bug" you found, which is quite interesting and intriguing.
I just want to share that there is a great way to avoid calculation delays. I had fantastic results with this and now I use it all the time.
Simply put, Excel takes a long time copying data back and forth between the "VBA world" and the "spreadsheet world".
If you do all the "reads" at once, process, and then do all the "writes" at once, you get amazing performance. This is done using variant arrays as documented here:
http://msdn.microsoft.com/en-us/library/ff726673.aspx#xlFasterVBA
in the section labeled: Read and Write Large Blocks of Data in a Single Operation
I was able to refactor some code I had that took 5 minutes to run and bring it down to 1.5 minutes. The refactoring took me 10 minutes, which is amazing because it was quite complex code.
Regarding Table performance (and performance, in general):
I know this is an old question, but I want to get this documented.
One thing that changed between older versions of Excel and the post-2007 versions is that Excel now activates the target sheet of any PasteSpecial operation. You cannot override it by turning off ScreenUpdating and making calculations manual. Such Activation WILL make the sheet visible, and cause uncontrollable flicker.
My original VBA code ran very fast on an old, single-processor XP box running Excel 2000. The change to Excel 2013 on a modern machine was stunning in the terrible slowness of code execution. The three areas that kill performance are PasteSpecial from one sheet to another, any other code that requires activating sheets (Zoom level, Advanced Filter, Sheet-Level range names, etc), and automating sheet protection/unprotection.
This is too bad, because PasteSpecial helped "cleanse" data you copy (Direct use of .Copy to a target will throw the occasional error).
So you need to review your code and make sure you are using direct assignment to the right property for the data type you need (from among Value, Value2, Text, and Formula, for example), instead of PasteSpecial.
e.g. .Range("MYRANGE").Value = .Cells(5, 7).Value2
You also need to be scrupulous in resisting use of Select and Activate throughout your code.
As referenced above, many comments you'll find in Excel fora about that last point will make statements that you "never" need to use Activation, which is clearly untrue, since several things in Excel only apply to or require active sheets. Understanding the cases where activation is forced automatically by a particular method or use of an object will help in coding as well. Unfortunately, you won't see much in the way of documentation of this.
Update:
Regarding Conditional Formatting, you'll find many complaints in various fora about the slowness of Excel when encountering a large number of Conditionally-formatted cells. I suspected this would impact Excel Tables since they have many table format options. To test this, I took a large workbook we use that is currently formatted as several worksheets with the same style of Excel Table on them.
After converting the tables to a conventional range, I noticed no difference in speed of code execution. This would seem to indicate that use of Excel Table formats is far superior to conditionally-formatting your own arrays of cells.

efficient serverside autocomplete

First off all I know:
Premature optimization is the root of all evil
But I think wrong autocomplete can really blow up your site.
I would to know if there are any libraries out there which can do autocomplete efficiently(serverside) which preferable can fit into RAM(for best performance). So no browserside javascript autocomplete(yui/jquery/dojo). I think there are enough topic about this on stackoverflow. But I could not find a good thread about this on stackoverflow (maybe did not look good enough).
For example autocomplete names:
names:[alfred, miathe, .., ..]
What I can think off:
simple SQL like for example: SELECT name FROM users WHERE name LIKE al%.
I think this implementation will blow up with a lot of simultaneously users or large data set, but maybe I am wrong so numbers(which could be handled) would be cool.
Using something like solr terms like for example: http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=al&wt=json&omitHeader=true.
I don't know the performance of this so users with big sites please tell me.
Maybe something like in memory redis trie which I also haven't tested performance on.
I also read in this thread about how to implement this in java (lucene and some library created by shilad)
What I would like to hear is implementation used by sites and numbers of how well it can handle load preferable with:
Link to implementation or code.
numbers to which you know it can scale.
It would be nice if it could be accesed by http or sockets.
Many thanks,
Alfred
Optimising for Auto-complete
Unfortunately, the resolution of this issue will depend heavily on the data you are hoping to query.
LIKE queries will not put too much strain on your database, as long as you spend time using 'EXPLAIN' or the profiler to show you how the query optimiser plans to perform your query.
Some basics to keep in mind:
Indexes: Ensure that you have indexes setup. (Yes, in many cases LIKE does use the indexes. There is an excellent article on the topic at myitforum. SQL Performance - Indexes and the LIKE clause ).
Joins: Ensure your JOINs are in place and are optimized by the query planner. SQL Server Profiler can help with this. Look out for full index or full table scans
Auto-complete sub-sets
Auto-complete queries are a special case, in that they usually works as ever decreasing sub sets.
'name' LIKE 'a%' (may return 10000 records)
'name' LIKE 'al%' (may return 500 records)
'name' LIKE 'ala%' (may return 75 records)
'name' LIKE 'alan%' (may return 20 records)
If you return the entire resultset for query 1 then there is no need to hit the database again for the following result sets as they are a sub set of your original query.
Depending on your data, this may open a further opportunity for optimisation.
I will no comply with your requirements and obviously the numbers of scale will depend on hardware, size of the DB, architecture of the app, and several other items. You must test it yourself.
But I will tell you the method I've used with success:
Use a simple SQL like for example: SELECT name FROM users WHERE name LIKE al%. but use TOP 100 to limit the number of results.
Cache the results and maintain a list of terms that are cached
When a new request comes in, first check in the list if you have the term (or part of the term cached).
Keep in mind that your cached results are limited, some you may need to do a SQL query if the term remains valid at the end of the result (I mean valid if the latest result match with the term.
Hope it helps.
Using SQL versus Solr's terms component is really not a comparison. At their core they solve the problem the same way by making an index and then making simple calls to it.
What i would want to know is "what you are trying to auto complete".
Ultimately, the easiest and most surefire way to scale a system is to make a simple solution and then just scale the system by replicating data. Trying to cache calls or predict results just make things complicated, and don't get to the root of the problem (ie you can only take them so far, like if each request missed the cache).
Perhaps a little more info about how your data is structured and how you want to see it extracted would be helpful.

Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset?

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Resources