NHibernate IsUpdateNecessary takes enormous amount of time - performance

My C# 3.5 application uses SQL Server 2008 R2, NHibernate and CastleProject ActiveRecord. The application imports emails to database along with their attachments. Saving of emails and attachments is performed by 50 emails in new session and transaction scope to make sure they are not stored in memory (there can be 100K of emails in some mailbox).
Initially emails are saved very quickly. However, closer to 20K emails performance degrades dramatically. Using dotTrace I got the following picture:
Obviously, when I save an attachment, NHibernate tries to see if it really should save it and probably compares with another attachments in the session. To do so, it compares them byte by byte what takes almost 500 seconds (for the snapshot on the picture) and 600M enumerator operations.
All this looks crazy, especially when I know for sure that SaveAndFlush indeed should save the attachment without any checks: I know for sure that it is new and should be saved.
However, I cannot figure out, how to instruct NHibernate to avoid this check (IsUpdateNecessary). Please advise.
P.S. I am not sure but it might appear that degradation of performance closer to 20K is not connected with having some older mails in memory: I noticed that in mailbox I am working with, larger emails are stored later than smaller so the problem may be only in attachments comparison.
Update:
Looks like I need something like StatelessSessionScope, but there is no documentation on it even at CastleProject site! If I do something like
using (TransactionScope txScope = new TransactionScope())
using (StatelessSessionScope scope = new StatelessSessionScope())
{
mail.Save();
}
it fails with exception that Save is not supported by stateless session. I am supposed to insert objects into session, but I do not have any Session (only SessionScope, which adds up to SessionScope only single OpenSession method which accepts strange paramenters).

May be I missed it in that long text, but are you using stateless session for importing data? Using that prevents a lot of checks and also bypasses first level cache, thus using minimal resources.

Looks like I've found an easy solution: for my class Attachment, causing biggest performance penalty, I overridden the following method:
protected override int[] FindDirty(
object id,
System.Collections.IDictionary previousState,
System.Collections.IDictionary currentState, NHibernate.Type.IType[] types)
{
return new int[0];
}
Thus, dirty check always consider it dirty and does not do that crazy per-byte comparison.

Related

Batching stores transparently

We are using the following frameworks and versions:
jOOQ 3.11.1
Spring Boot 2.3.1.RELEASE
Spring 5.2.7.RELEASE
I have an issue where some of our business logic is divided into logical units that look as follows:
Request containing a user transaction is received
This request contains various information, such as the type of transaction, which products are part of this transaction, what kind of payments were done, etc.
These attributes are then stored individually in the database.
In code, this looks approximately as follows:
TransactionRecord transaction = transactionRepository.create();
transaction.create(creationCommand);`
In Transaction#create (which runs transactionally), something like the following occurs:
storeTransaction();
storePayments();
storeProducts();
// ... other relevant information
A given transaction can have many different types of products and attributes, all of which are stored. Many of these attributes result in UPDATE statements, while some may result in INSERT statements - it is difficult to fully know in advance.
For example, the storeProducts method looks approximately as follows:
products.forEach(product -> {
ProductRecord record = productRepository.findProductByX(...);
if (record == null) {
record = productRepository.create();
record.setX(...);
record.store();
} else {
// do something else
}
});
If the products are new, they are INSERTed. Otherwise, other calculations may take place. Depending on the size of the transaction, this single user transaction could obviously result in up to O(n) database calls/roundtrips, and even more depending on what other attributes are present. In transactions where a large number of attributes are present, this may result in upwards of hundreds of database calls for a single request (!). I would like to bring this down as close as possible to O(1) so as to have more predictable load on our database.
Naturally, batch and bulk inserts/updates come to mind here. What I would like to do is to batch all of these statements into a single batch using jOOQ, and execute after successful method invocation prior to commit. I have found several (SO Post, jOOQ API, jOOQ GitHub Feature Request) posts where this topic is implicitly mentioned, and one user groups post that seemed explicitly related to my issue.
Since I am using Spring together with jOOQ, I believe my ideal solution (preferably declarative) would look something like the following:
#Batched(100) // batch size as parameter, potentially
#Transactional
public void createTransaction(CreationCommand creationCommand) {
// all inserts/updates above are added to a batch and executed on successful invocation
}
For this to work, I imagine I'd need to manage a scoped (ThreadLocal/Transactional/Session scope) resource which can keep track of the current batch such that:
Prior to entering the method, an empty batch is created if the method is #Batched,
A custom DSLContext (perhaps extending DefaultDSLContext) that is made available via DI has a ThreadLocal flag which keeps track of whether any current statements should be batched or not, and if so
Intercept the calls and add them to the current batch instead of executing them immediatelly.
However, step 3 would necessitate having to rewrite a large portion of our code from the (IMO) relatively readable:
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
to:
userObjects.forEach(userObject -> {
dslContext.insertInto(...).values(userObject.getX(), ...).execute();
}
which would defeat the purpose of having this abstraction in the first place, since the second form can also be rewritten using DSLContext#batchStore or DSLContext#batchInsert. IMO however, batching and bulk insertion should not be up to the individual developer and should be able to be handled transparently at a higher level (e.g. by the framework).
I find the readability of the jOOQ API to be an amazing benefit of using it, however it seems that it does not lend itself (as far as I can tell) to interception/extension very well for cases such as these. Is it possible, with the jOOQ 3.11.1 (or even current) API, to get behaviour similar to the former with transparent batch/bulk handling? What would this entail?
EDIT:
One possible but extremely hacky solution that comes to mind for enabling transparent batching of stores would be something like the following:
Create a RecordListener and add it as a default to the Configuration whenever batching is enabled.
In RecordListener#storeStart, add the query to the current Transaction's batch (e.g. in a ThreadLocal<List>)
The AbstractRecord has a changed flag which is checked (org.jooq.impl.UpdatableRecordImpl#store0, org.jooq.impl.TableRecordImpl#addChangedValues) prior to storing. Resetting this (and saving it for later use) makes the store operation a no-op.
Lastly, upon successful method invocation but prior to commit:
Reset the changes flags of the respective records to the correct values
Invoke org.jooq.UpdatableRecord#store, this time without the RecordListener or while skipping the storeStart method (perhaps using another ThreadLocal flag to check whether batching has already been performed).
As far as I can tell, this approach should work, in theory. Obviously, it's extremely hacky and prone to breaking as the library internals may change at any time if the code depends on Reflection to work.
Does anyone know of a better way, using only the public jOOQ API?
jOOQ 3.14 solution
You've already discovered the relevant feature request #3419, which will solve this on the JDBC level starting from jOOQ 3.14. You can either use the BatchedConnection directly, wrapping your own connection to implement the below, or use this API:
ctx.batched(c -> {
// Make sure all records are attached to c, not ctx, e.g. by fetching from c.dsl()
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
});
jOOQ 3.13 and before solution
For the time being, until #3419 is implemented (it will be, in jOOQ 3.14), you can implement this yourself as a workaround. You'd have to proxy a JDBC Connection and PreparedStatement and ...
... intercept all:
Calls to Connection.prepareStatement(String), returning a cached proxy statement if the SQL string is the same as for the last prepared statement, or batch execute the last prepared statement and create a new one.
Calls to PreparedStatement.executeUpdate() and execute(), and replace those by calls to PreparedStatement.addBatch()
... delegate all:
Calls to other API, such as e.g. Connection.createStatement(), which should flush the above buffered batches, and then call the delegate API instead.
I wouldn't recommend hacking your way around jOOQ's RecordListener and other SPIs, I think that's the wrong abstraction level to buffer database interactions. Also, you will want to batch other statement types as well.
Do note that by default, jOOQ's UpdatableRecord tries to fetch generated identity values (see Settings.returnIdentityOnUpdatableRecord), which is something that prevents batching. Such store() calls must be executed immediately, because you might expect the identity value to be available.

Parse Server - Saving objects with many fields - Schema Validation takes too long (enforceFieldExists)

We're using ParseServer to migrate a CloudCode based application to Heroku.
Using versions:
parse#1.8.5
parse-server#2.2.16
We noticed (its hard not to notice) that saving some is unreasonably slow. These objects are typically saved a few at a time - between 2 to 6 objects (using an Parse.Object.saveAll which fires a REST call to /1/batch
Saving each of these objects now takes anything between 4 to 12 seconds. Digging into Parse code, it was easy to see that schema validation is the cause.
SchemaController.validateObject() {
...
SchemaController.enforceFieldExists()
We are using triggers for simple validation, but as per the logic in RestWrite.js this causes schema validation to be executed twice - once before trigger and once after.
The problem lies in that our collection has about 40 fields. SchemaController.enforceFieldExists() loads the entire schema twice while attempting to validate each field. Moreover, it always attempts to write to the schema document (again, for each field), only to fail usually because all fields are already listed in the schema.
this means that we get an overhead of about 240 round trips to the database for each object, and we store up to 5 objects typically in each invocation. that adds up to over 1000 round trips to the database. so we easily go beyond the Heroku router timeout limit of 30 seconds.
My questions are:
Is there anything I can do to speed up this validation? (did not find documentation or settings for that)
Is there a fix for this redundant implementation planned or available anywhere?
Can I safely castrate enforceFieldExists() to do nothing without anything else breaking on me assuming we don't add fields often? What is this collection (_SCHEMA) used for other tan to draw the tables in Dashboard UI?
I'm currently thinking about patching this function to do nothing with an npm postinstall script. Does that sound like a good approach?
Appreciate any help on this,
Ron
This is being fixed with this pull request https://github.com/ParsePlatform/parse-server/pull/2286
and that line
https://github.com/ParsePlatform/parse-server/pull/2286/files#diff-7d0dd667d7bdafd6ebee06cf70139fa0R555
This will skip trying to write the schema is the current field is available.
This should be released soon

How to update/migrate data when using CQRS and an EventStore?

So I'm currently diving the CQRS architecture along with the EventStore "pattern".
It opens applications to a new dimension of scalability and flexibility as well as testing.
However I'm still stuck on how to properly handle data migration.
Here is a concrete use case:
Let's say I want to manage a blog with articles and comments.
On the write side, I'm using MySQL, and on the read side ElasticSearch, now every time a I process a Command, I persist the data on the write side, dispatch an Event to persist the data on the read side.
Now lets say I've some sort of ViewModel called ArticleSummary which contains an id, and a title.
I've a new feature request, to include the article tags to my ArticleSummary, I would add some dictionary to my model to include the tags.
Given the tags did already exist in my write layer, I would need to update or use a new "table" to properly use the new included data.
I'm aware of the EventLog Replay strategy which consists in replaying all the events to "update" all the ViewModel, but, seriously, is it viable when we do have a billion of rows?
Is there any proven strategies? Any feedbacks?
I'm aware of the EventLog Replay strategy which consists in replaying
all the events to "update" all the ViewModel, but, seriously, is it
viable when we do have a billion of rows?
I would say "yes" :)
You are going to write a handler for the new summary feature that would update your query side anyway. So you already have the code. Writing special once-off migration code may not buy you all that much. I would go with migration code when you have to do an initial update of, say, a new system that requires some data transformation once off, but in this case your infrastructure would exist.
You would need to send only the relevant events to the new handler so you also wouldn't replay everything.
In any event, if you have a billion rows of data your servers would probably be able to handle the load :)
Im currently using the NEventStore by JOliver.
When we started, we were replaying our entire store back through our denormalizers/event handlers when the application started up.
We were initially keeping all our data in memory but knew this approach wouldn't be viable in the long term.
The approach we use currently is that we can replay an individual denormalizer, which makes things a lot faster since you aren't unnecessarily replaying events through denomalizers that haven't changed.
The trick we found though was that we needed another representation of our commits so we could query all the events that we handled by event type - a query that cannot be performed against the normal store.

Storing, Loading, and Updating a Trie in ASP.NET MVC 3

I have a trie-based word detection algorithm for a custom dictionary. Note that regular expressions are too brittle with this dictionary as entries may contain spaces, periods, etc.
I've implemented the algorithm in a local C# app that reads in the dictionary from file and stores the trie in memory (it's compact, so no RAM size issues at all). Now I would like to use this algorithm in an MVC 3 app on a cloud host like AppHarbor, with the added twist that I want a web interface to enable adding/editing words.
It's fast enough that loading the dictionary from file and building the trie every time a user uploads their text would not be an issue (< 1s on my laptop). However, if I want to enable admins to edit the dictionary via the web interface, that would seem tricky since the dictionary would potentially be getting updated while a user is trying to upload text for analysis.
What is the best strategy for storing, loading, and updating the trie in an MVC 3 app?
I'm not sure if you are looking for specific implementation details, or more conceptual ideas about how to handle but I'll throw some ideas out there for now.
Actual Trie Classes - Here is a good C# example of classes for setting up a Trie. It sounds like you already have this part figured out.
Storing: I would persist the trie data to XML unless you are already using a database and have some need to have it in a dbms. The XML will be simple to work with in the MVC application and you don't need to worry about database connectivity issues, or the added cost of a database. I would also have two versions of the trie data on the server, a production copy and a production support copy, the second for which your admin can perform transactions against.
Loading In your admin module of the application, you may implement a feature for loading the trie data into memory, the frequency of data loading depends on your application needs. It could be scheduled or available as a manual function. Like in wordpress sites, if a user should access it while updating they would receive a message that the site is undergoing maintenance. You may choose to load into memory on demand only, and keep the trie loaded at all times except for if problems occurred.
Updating - I'd have a second database (or XML file) that is used for applying updates. The method of applying updates to production would depend partially on the frequency, quantity, and time of updates. One safe method might be to store transactions entered by the admin.
For example:
trie.put("John", 112);
trie.put("Doe", 222);
trie.Remove("John");
Then apply these transactions to your production data as needed via an admin function. If needed put your site into "maint" mode. If the updates are few and fast you may be able to code the site so that it will hold all work until transactions are processed, a user might have to wait a few milliseconds longer for a result but you wouldn't have to worry about mutating data issues.
This is pretty vague but just throwing some ideas out there... if you provide comments I'll try to give more.
1 Store trie in cache:
It is not dynamic data, and caching helps us in other tasks (like concurrency access to trie by admin and user)
2 Make access to cache clear:
:
public class TrieHelper
{
public Trie MyTrie
{
get
{
if (HttpContext.Current.Cache["myTrieKey"] == null)
HttpContext.Current.Cache["myTrieKey"] = LoadTrieFromFile(); //Returns Trie object
return (Trie)HttpContext.Current.Cache["myTrieKey"];
}
}
3 Lock trie object while adding operation in progress
public void AddWordToTrie(string word)
{
var trie = MyTrie;
lock (HttpContext.Current.Cache["myTrieKey"])
{
trie.AddWord(word);
} // notify that trie object locking when write data to file is not reuired
WriteNewWordToTrieFile(word); // should lock FileWriter object
}
}
4 If editing is performs by 1 admin at a time - store trie in xml file - it will be easy to implement logic of search element, after what word your word should be added (you can create function, that will use MyTrie object in memory), and add it, using linq to xml.
I've got a kind'a the same but 10 times bigger :)
The client design it's own calendar with questions ans possible answer in the meanwhile some is online and being used by the normal user.
What I come up was something as test and deploy. The Admin enters the calendar values and set it up correctly and after he can use a Preview button to see if it's like he needs/wants, then, to make the changes valid to all end users, he need to push Deploy.
He, as an ADMIN, will know that, until he pushes the DEPLOY button, all users accessing the Calendar will have the old values. Soon he hits deploy all is set in the Database, and pushed the files he uploaded into Amazon S3 (for faster access).
I update the Cache with the new calendar and the new Calendar object is cached until the App pool says otherwise or he hit the Deploy button again.
You could do something like this.
As you are going to perform your application in the cloud environment, I'd suggest you to take a look at CQRS and durable messaging and provide some concurrency model (possibly, optimistic concurrency and intelligent conflict detection http://skillsmatter.com/podcast/design-architecture/cqrs-not-just-for-server-systems 5:00)
Also, obviously, you need to analyze your business requirements more precisely because, as Udi Dahan mentioned, race conditions are result of the lack of business analysis.

Core Data is using a lot of memory

I have a data model which is sort of like this simplified drawing:
alt text http://dl.dropbox.com/u/545670/thedatamodel.png
It's a little weird, but the idea is that the app manages multiple accounts/identities a person may have into a single messaging system. Each account is associated with one user on the system, and each message could potentially be seen/sent-to multiple accounts (but they have a globally unique ID hence the messageID property which is used on import to fetch message objects that may have already been downloaded and imported by a prior session).
The app is used from a per-account point of view - what I mean is that you choose which account you want to use, then you see the messages and stuff from that account's point of view in your window. So I have the messages attached to the account so that I can easily get the messages that should be shown using a fetch like this:
fetch.fetchPredicate = [NSPredicate predicateWithFormat:#"%# IN accounts", theAccount];
fetch.sortDescriptors = [NSArray arrayWithObject:[[NSSortDescriptor alloc] initWithKey:#"date" ascending:NO]];
fetch.fetchLimit = 20;
This seems like the right way to set this up in that the messages are shared between accounts and if a message is marked as read by one, I want it seen as being read by the other and so on.
Anyway, after all this setup, the big problem is that memory usage seems to get a little crazy. When I setup a test case where it's importing hundreds of messages into the system, and periodically re-fetching (using the fetch mentioned above) and showing them in a list (only the last 20 are referenced by the list), memory just gets crazy. 60MB.. 70MB... 100MB.. etc.
I tracked it down to the many-to-many relation between Account and Message. Even with garbage collection on, the managed objects are still being referenced strongly by the account's messages relationship property. I know this because I put a log in the finalize of my Message instance and never see it - but if I periodically reset the context or do refreshObject:mergeChanges: on the account object, I see the finalize messages and memory usage stays pretty consistent (although still growing somewhat, but considering I'm importing stuff, that's to be expected). The problem is that I can't really reset the context or the account object all the time because that really messes up observers that are observing other attributes of the account object!
I might just be modeling this wrong or thinking about it wrong, but I keep reading over and over that it's important to think of Core Data as an object graph and not a database. I think I've done that here, but it seems to be causing trouble. What should I do?
Use the Object Graph instrument. It'll tell you all of the ownerships keeping an object alive.
Have you read the section of the docs on this topic?

Resources