Improving NHibernate performance with too many objects in session - performance

Our app was originally built with NHibernate and its limitations of batch processing in mind. However, over time it has transformed into a data cruncher and we are observing a significant performance decay.
The session ends up having to maintain about 1000 objects or more and our profiling has revealed that auto flushing and dirty checking are the biggest offenders here. We tried shutting auto flush and managing it ourselves on Save/Update operations but that led to disastrous performance for a batch save/update.
We're now looking at the option of evicting unrequired objects from the session.
I came across 2nd level-cache eviction method (sessionFactory.Evict(typeof(Cat));) which lets us evict by type but we do not use a 2nd level cache. Can I still use this method to evict objects from the 1st level cache?
I also read about one pattern of fetching objects, evicting them from session, and then reassociating them, if needed, with session by calling Update() on them. Is this a recommended and accepted pattern cause I also read that NH3 has put up a wall to this? (We can still use it as we have not upgraded to NH3)
While we realize that we are not using NHibernate in the best way, we are just looking to improve the current situation somehow. Answers to the above questions and any other suggestions/recommendations are greatly appreciated. Thanks.
Update
After looking at NH documentation and code, I realize that 1 is probably not possible. I'm still looking at some pointers or tips on using Evict(). I was able to drastically reduce the number of objects in a session. But still do not know if there is a price to pay while updating or deleting evicted objects. Thanks for your help in advance.

It's hard to say without knowing more about your requirements but maybe you could use IStatelessSession. It doesn't have a 1st level cache to worry about.
Ayende has a good post on using it for bulk operations
here

Why not use more sessions, instead of one large one? That, in conjunction with turning off autoflush has helped me in the past. Also, you should really think about using HQL for bulk updates if possible.

I know that this is old, but I just came across this while looking for something else -- having just solved this. I did solve as Trent mentioned, by using more than one session. I would create one session to fetch all of the objects I wanted, then closed that session. The case I had, was iterating through the list and operating on each object and trying to commit on each iteration. I would then create the foreach over my list, creating and disposing of a new session inside the loop, reattaching my object from the list to the new session. That took a process that was taking about 2.5 hours down to 2 minutes 40 seconds!
See this article for the inspiration to how I solved it -- although not exactly as I have unit of work wrappers around NHibernate:
http://weblogs.asp.net/ricardoperes/archive/2013/03/21/attaching-disconnected-entities-in-nhibernate-without-going-to-the-database.aspx

Related

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

How to keep your distributed cache clean?

In a N-Tier architecture, what would be the best patterns to use so that you can keep your cache clean?
I know it's easy to just set an absolute/sliding timeout, but is there a better mechanism available to allow you to mark your cache as dirty after you update the underlying persistence.
The difficulty I"m trying to wrap my head around is that Cache are usually stored as KVP. But a query is usually a fair bit more complex than that. So how can the gateway service tell the cache store that for such and such query, it needs to refetch from persistence.
I also can't afford to hand-code the cache update per query. I'm looking for a more systematic approach.
Is this just a pipe dream, or is there some way to do this elegantly?
Link/Guide/Post appreciated.
I have worked with AppFabric and I think tried to do what you are asking about. I was working on an auction site and I wanted to pro-actively invalidate items in the cache.
For example, we had listings (things for sale) and they would be present all over the cache (AppFabric). The data that represented a listing was in 10 different places. What I initially wanted was a way to say, "Ok, my listing has changed. Let me go find everywhere it exists in cache, and then update." (I think you say "mark as dirty" in your question)
I found doing this was incredibly difficult. There are tags in AppFabric that I tried to use, so I would mark a given object (or collection of objects) with a tag and that would let me query the cache and remove items. In other words, if an object had a LISTING tag, I would find it and invalidate it.
Eventually I settled on a two-pronged attack.
For 95% of the data I let it expire. It was a happy day when I decided this because everything got much easier to develop. I had to make some concessions in the UI etc., but it was well worth it.
For the last 5% of the data I resolved to only ever store it once. For example, a bid on a listing. Whenever a new bid came in, we'd pro-actively invalidate that object, and then everything that needed that information would be updated as well.

Write heavy dml-operations in MongoDB

I'm running MongoDB (2.2) on Linux, and I have a few questions.
I have schema with many fields + sub-fields and one index for this fields.
How fast are updates/delete done on the index -- I have about 3 Updates/Deletes etc. a second.
Is there a rule, like after 10,000 updates you have to compact or rebuild the index?
Are changes in the fields immediately visible in the index? If not is there a delay or a temporary table for this updates/deletes?
Thanks in advance - Brandon
Indexes are updated at the time of insert/update/remove. About performance the best answer would be to just test it.
Not that I would know of. If you need to do regular compaction or repair you should have replication too (but you can have it on the same host if resources permit)
Yes (well, on the same DB connection - on other it might take a bit more time. But if you're having that problem I'm not the right person to answer you anyway ;)
Having said that, I strongly suggest you take a look at some of the presentations at http://www.10gen.com/presentations - I'm sorry i can't point out the ones that were particularly interesting and usable, I suggest you browse and pick the ones that seem interesting to you.
Note that MongoDB does things VERY differently and has quite a few gotchas for the unprepared. It is however a great DB once you know how to use it.

Mahout's datamodel with GenericDataModel

I am playing around with Mahout's recommendation engines and are running into problem with using genericdatamodel object. My question is if I want to add some new users data into the existing datamodel, is the only way to do it, by reconstruction of a new datamodel by reading all the data again.
Currently, our data is in the cache.
Yes, that's correct. It's effectively read-only for performance. The general idea is that you don't incorporate data model updates frequently, as it generally means rebuilding a lot of other pre-computed or cached computations.
You could hack it to expose an update method without too much trouble. Just be careful of thread-safety issues.

Core Data and threading

What are some of the obscure pitfalls of using Core Data and threads? I've read much of the documentation, and so far I've come across the following either in the docs or through painful experience:
Use a new NSManagedObjectContext for each thread, but a single NSPersistentStoreCoordinator is enough for the whole app.
Before sending an NSManagedObject's objectID back to the main thread (or any other thread), be sure the context has been saved (or at a minimum, it wasn't a newly-inserted-but-not-yet-saved object) - otherwise the objectID will actually be a temporary ID and not a persistent one.
Use mergeChangesFromContextDidSaveNotification: to detect when a save happens in another thread and use that to merge those changes with the current thread's context.
Bonus question/observation: I was led to believe by the wording of some of the docs that mergeChangesFromContextDidSaveNotification: is something only needed by the main thread to merge changes into the "main" context from worker threads - but I don't think that's the case.
I set up my importer to create batches of data which are imported using a subclass of an NSOperation that owns it's own context. The operations are loaded into an NSOperationQueue that's set to allow the default number of concurrent operations, so it's possible for several import batches to be running at the same time. I would occasionally get very strange validation errors and exceptions (like trying to add nil to a relationship) and other failures that I had never seen when I did all the same stuff on the main thread. It occurred to me (and perhaps this should have been obvious) that maybe the context merging needed to be done for all contexts in every thread - not just the "main" one! I don't know why I didn't think of that before, but I think this helped. (It hasn't been tested well enough yet for me to feel sure, though.) In any case, is it true that you need to observe that notification for ALL import threads that may be working with the same datasets and adding/updating the same entities? If so, this is yet another pitfall bullet point, IMO, although I have yet to be certain that it'll work.
Given how many of these I've run into with Core Data in general (and not all of them just about multi-threading), I have to wonder how many more are lurking. Since multi-threading so often ends up with bugs that are difficult if not impossible to reproduce due to the timing issues, I figured I'd ask if anyone had other important things that I may be missing that I need to concern myself with.
There is an entire rather large bit of documentation devoted to the subject of Core Data and Threading.
It isn't clear from your set of issues what isn't covered by that documentation.

Resources