Grails hibernate session in batches - performance

GORM works fine out of the box as long as there is no batch with more than 10.000 objects. Without optimisation you will face the outOfMemory problems.
The common solution is to flush() and clear() the session each n (e.g.n=500) objects:
Session session = sessionFactory.currentSession
Transaction tx = session.beginTransaction();
def propertyInstanceMap = org.codehaus.groovy.grails.plugins.DomainClassGrailsPlugin.PROPERTY_INSTANCE_MAP
Date yesterday = new Date() - 1
Criteria c = session.createCriteria(Foo.class)
c.add(Restrictions.lt('lastUpdated',yesterday))
ScrollableResults rawObjects = c.scroll(ScrollMode.FORWARD_ONLY)
int count=0;
while ( rawObjects.next() ) {
def rawOject = rawObjects.get(0);
fooService.doSomething()
int batchSize = 500
if ( ++count % batchSize == 0 ) {
//flush a batch of updates and release memory:
try{
session.flush();
}catch(Exception e){
log.error(session)
log.error(" error: " + e.message)
throw e
}
session.clear();
propertyInstanceMap.get().clear()
}
}
session.flush()
session.clear()
tx.commit()
But there are some problems I can't solve:
If I use currentSession, then the controller fails because of session is empty
If I use sessionFactory.openSession(), then the currentSession is still used inside FooService. Of cause I can use the session.save(object) notation. But this means, that I have to modify fooService.doSomething() and duplicate code for single operation (common grails notation like fooObject.save() ) and batch operation (session.save(fooObject() ).. notation).
If I use Foo.withSession{session->} or Foo.withNewSession{session->}, then the objects of Foo Class are cleared by session.clear() as expected. All the other objects are not cleared(), what leads to memory leak.
Of cause I can use evict(object) to manualy clear the session. But it is nearly impossible to get all relevant objects, due to autofetching of assosiations.
So I have no idea how to solve my problems without making the FooService.doSomething() more complex. I'm looking for something like withSession{} for all domains. Or to save session at the begin (Session tmp = currentSession) and do something like sessionFactory.setCurrentSession(tmp). Both doesn't exists!
Any idea is wellcome!

I would recommend to use stateless session for this kind of batch processing. See this post: Using StatelessSession for Batch processing

A modified approach to what you are doing would be:
Loop over your entire collection (rawObjects) and save a list of all the ids for those objects.
Loop over the list of ids. At each iteration, look up just that single object, by its id.
Then use the same periodic clearing of the session cache like you are doing now.
By the way, someone else has suggested an approach similar to yours. But note that the code in this link is incorrect; the lines that clear the session should be inside the if statement, just like you have in your solution.

Related

Select Count very slow using EF with Oracle

I'm using EF 5 with Oracle database.
I'm doing a select count in a table with a specific parameter. When I'm using EF, the query returns the value 31, as expected, But the result takes about 10 seconds to be returned.
using (var serv = new Aperam.SIP.PXP.Negocio.Modelos.SIP_PA())
{
var teste = (from ens in serv.PA_ENSAIOS_UM
where ens.COD_IDENT_UNMET == "FBLDY3840"
select ens).Count();
}
If I execute the simple query bellow the result is the same (31), but the result is showed in 500 milisecond.
SELECT
count(*)
FROM
PA_ENSAIOS_UM
WHERE
COD_IDENT_UNMET 'FBLDY3840'
There are a way to improve the performance when I'm using EF?
Note: There are 13.000.000 lines in this table.
Here are some things you can try:
Capture the query that is being generated and see if it is the same as the one you are using. Details can be found here, but essentially, you will instantiate your DbContext (let's call it "_context") and then set the Database.Log property to be the logging method. It's fine if this method doesn't actually do anything--you can just set a breakpoint in there and see what's going on.
So, as an example: define a logging function (I have a static class called "Logging" which uses nLog to write to files)
public static void LogQuery(string queryData)
{
if (string.IsNullOrWhiteSpace(queryData))
return;
var message = string.Format("{0}{1}",
queryData.Trim().Contains(Environment.NewLine) ?
Environment.NewLine : "", queryData);
_sqlLogger.Info(message);
_genLogger.Trace($"EntityFW query (len {message.Length} chars)");
}
Then when you create your context point to LogQuery:
_context.Database.Log = Logging.LogQuery;
When you do your tests, remember that often the first run is the slowest because the server has to actually do the work, but on the subsequent runs, it often uses cached data. Try running your tests 2-3 times back to back and see if they don't start to run in the same time.
I don't know if it generates the same query or not, but try this other form (which should be functionally equivalent, but may provide better time)
var teste = serv.PA_ENSAIOS_UM.Count(ens=>ens.COD_IDENT_UNMET == "FBLDY3840");
I'm wondering if the version you have pulls data from the DB and THEN counts it. If so, this other syntax may leave all the work to be done at the server, where it belongs. Not sure, though, esp. since I haven't ever used EF with Oracle and I don't know if it behaves the same as SQL or not.

Using "Any" or "Contains" when context not saved yet

Why isn't the exception triggered? Linq's "Any()" is not considering the new entries?
MyContext db = new MyContext();
foreach (string email in {"asdf#gmail.com", "asdf#gmail.com"})
{
Person person = new Person();
person.Email = email;
if (db.Persons.Any(p => p.Email.Equals(email))
{
throw new Exception("Email already used!");
}
db.Persons.Add(person);
}
db.SaveChanges()
Shouldn't the exception be triggered on the second iteration?
The previous code is adapted for the question, but the real scenario is the following:
I receive an excel of persons and I iterate over it adding every row as a person to db.Persons, checking their emails aren't already used in the db. The problem is when there are repeated emails in the worksheet itself (two rows with the same email)
Yes - queries (by design) are only computed against the data source. If you want to query in-memory items you can also query the Local store:
if (db.Persons.Any(p => p.Email.Equals(email) ||
db.Persons.Local.Any(p => p.Email.Equals(email) )
However - since YOU are in control of what's added to the store wouldn't it make sense to check for duplicates in your code instead of in EF? Or is this just a contrived example?
Also, throwing an exception for an already existing item seems like a poor design as well - exceptions can be expensive, and if the client does not know to catch them (and in this case compare the message of the exception) they can cause the entire program to terminate unexpectedly.
A call to db.Persons will always trigger a database query, but those new Persons are not yet persisted to the database.
I imagine if you look at the data in debug, you'll see that the new person isn't there on the second iteration. If you were to set MyContext db = new MyContext() again, it would be, but you wouldn't do that in a real situation.
What is the actual use case you need to solve? This example doesn't seem like it would happen in a real situation.
If you're comparing against the db, your code should work. If you need to prevent dups being entered, it should happen elsewhere - on the client or checking the C# collection before you start writing it to the db.

SqlAlchemy - when I iterate on a query, do I get a list or a iterator?

I'm starting to learn how to use SQLAlchemy and I'm running into some efficiency problems.
I created an object mapping an existing big table on our Oracle database:
engine = create_engine(connectionString, echo=False)
class POI(object):
def __repr__(self):
return "{poi_id} - {title}, {city} - {uf}".format(**self.__dict__)
def loadSession():
metadata = MetaData(engine)
_poi = Table('tbl_ourpois', metadata, autoload = True)
mapper(POI, _poi)
Session = sessionmaker(bind = engine)
session = Session()
return session
This table have millions of registries. When I do a simple query and try to iterate over it:
session = loadSession()
for poi in session.query(POI):
print poi
I noticed two things: (1) it takes some minutes for it to start printing objects on the screen, (2) memory usage starts to grow like crazy. So, my conclusion was that this code was fetching all the result set in a list and then iterating over it. Is this correct?
With cx_Oracle, when I do a query like:
conn = cx_Oracle.connect(connectionString)
cursor = conn.cursor()
cursor.execute("select * from tbl_ourpois")
for poi in cursor:
print poi
the resulting cursor behaves as an iterator that gets results into a buffer and returns them as they are needed intead of loading the whole thing in a list. This loop starts printing results almost instantly and memory usage is pretty low and constant.
Can I get this kind of behavior wiht SQLAlchemy? Is there a way to get a constant memory iterator out of session.query(POI) instead of a list?

entity framework - does this do a dirty read?

I have a bit of linq to entities code in a web app. It basically keeps a count of how many times an app was downloaded. I'm worried that this might happen:
Session 1 reads the download count (eg. 50)
Session 2 reads the download count (again, 50)
Session 1 increments it and writes it to the db (database stores 51)
Session 2 increments it and writes it to the db (database stores 51)
This is my code:
private void IncreaseHitCountDB()
{
JTF.JTFContainer jtfdb = new JTF.JTFContainer();
var app =
(from a in jtfdb.Apps
where a.Name.Equals(this.Title)
select a).FirstOrDefault();
if (app == null)
{
app = new JTF.App();
app.Name = this.Title;
app.DownloadCount = 1;
jtfdb.AddToApps(app);
}
else
{
app.DownloadCount = app.DownloadCount + 1;
}
jtfdb.SaveChanges();
}
Is it possible that this could happen? How could I prevent it?
Thank you,
Fidel
Entity Framework, by default, uses an optimistic concurrency model. Google says optimistic means "Hopeful and confident about the future", and that's exactly how Entity Framework acts. That is, when you call SaveChanges() it is "hopeful and confident" that no concurrency issue will occur, so it just tries to save your changes.
The other model Entity Framework can use should be called a pessimistic concurrency model ("expecting the worst possible outcome"). You can enable this mode on an entity-by-entity basis. In your case, you would enable it on the App entity. This is what I do:
Step 1. Enabling concurrency checking on an Entity
Right-click the .edmx file and choose Open With...
Choose XML (Text) Editor in the popup dialog, and click OK.
Locate the App entity in the ConceptualModels. I suggest toggling outlining and just expanding tags as necessary. You're looking for something like this:
<edmx:Edmx Version="2.0" xmlns:edmx="http://schemas.microsoft.com/ado/2008/10/edmx">
<!-- EF Runtime content -->
<edmx:Runtime>
<!-- SSDL content -->
...
<!-- CSDL content -->
<edmx:ConceptualModels>
<Schema Namespace="YourModel" Alias="Self" xmlns:annotation="http://schemas.microsoft.com/ado/2009/02/edm/annotation" xmlns="http://schemas.microsoft.com/ado/2008/09/edm">
<EntityType Name="App">
Under the EntityType you should see a bunch of <Property> tags. If one exists with Name="Status" modify it by adding ConcurrencyMode="Fixed". If the property doesn't exist, copy this one in:
<Property Name="Status" Type="Byte" Nullable="false" ConcurrencyMode="Fixed" />
Save the file and double click the .edmx file to go back to the designer view.
Step 2. Handling concurrency when calling SaveChanges()
SaveChanges() will throw one of two exceptions. The familiar UpdateException or an OptimisticConcurrencyException.
if you have made changes to an Entity which has ConcurrencyMode="Fixed" set, Entity Framework will first check the data store for any changes made to it. If there are changes, a OptimisticConcurrencyException will be thrown. If no changes have been made, it will continue normally.
When you catch the OptimisticConcurrencyException you need to call the Refresh() method of your ObjectContext and redo your calculation before trying again. The call to Refresh() updates the Entity(s) and RefreshMode.StoreWins means conflicts will be resolved using the data in the data store. The DownloadCount being changed concurrently is a conflict.
Here's what I'd make your code look like. Note that this is more useful when you have a lot of operations between getting your Entity and calling SaveChanges().
private void IncreaseHitCountDB()
{
JTF.JTFContainer jtfdb = new JTF.JTFContainer();
var app =
(from a in jtfdb.Apps
where a.Name.Equals(this.Title)
select a).FirstOrDefault();
if (app == null)
{
app = new JTF.App();
app.Name = this.Title;
app.DownloadCount = 1;
jtfdb.AddToApps(app);
}
else
{
app.DownloadCount = app.DownloadCount + 1;
}
try
{
try
{
jtfdb.SaveChanges();
}
catch (OptimisticConcurrencyException)
{
jtfdb.Refresh(RefreshMode.StoreWins, app);
app.DownloadCount = app.DownloadCount + 1;
jtfdb.SaveChanges();
}
}
catch (UpdateException uex)
{
// Something else went wrong...
}
}
You can prevent this from happenning if you only query the download count column right before you are about to increment it, the longer the time spent between reading and incrementing the longer the time another session has to read it (and later rewriting - wrongly - incremented number ) and thus messing up the count.
with a single SQL query :
UPDATE Data SET Counter = (Counter+1)
since its Linq To Entities,it means delayed execution,for another session to screw up the Count (increment the same base,losing 1 count there) it would have to try to increment the app.Download count i beleive between the two lines:
else
{
app.DownloadCount += 1; //First line
}
jtfdb.SaveChanges(); //Second line
}
thats means that the window for the change to occur, thus making the previous count old, is so small that for an application like this is virtually impossible.
Since Im no LINQ pro, i dont know whether LINQ actually gets app.DownLoadCount before adding one or just adds one through some SQL command, but in either case you shouldnt have to worry about that imho
You could easily test what would happen in this scenario - start a thread, sleep it, and then start another.
else
{
app.DownloadCount = app.DownloadCount + 1;
}
System.Threading.Thread.Sleep(10000);
jtfdb.SaveChanges();
But the simple answer is that no, Entity Framework does not perform any concurrency checking by default (MSDN - Saving Changes and Managing Concurrency).
That site will provide some background for you.
Your options are
to enable concurrency checking, which will mean that if two users download at the same time and the first updates after the second has read but before the second has updated, you'll get an exception.
create a stored procedure that will increment the value in the table directly, and call the stored procedure from code in a single operation - e.g. IncrementDownloadCounter. This will ensure that there is no 'read' and therefore no possibility of a 'dirty read'.

linq System.ObjectDisposedException

i have a problem with some data i retrievied from db with linq.
When I try to access data I obtain the following exception:
System.ObjectDisposedException : The istance of ObjectContext was deleted and is not possible to use it again for action that need a connection.
This is the code:
using (ProvaDbEntities DBEntities =
new ProvaDbEntities(Utilities.ToEntitiesConnectionString()))
{
ObjectQuery<site> sites = DBEntities.site;
IEnumerable<site> q = from site in sites
select site;
{
ObjectQuery<auction> auctions = DBEntities.auction;
IEnumerable<auction> q1 = from auction in auctions
where auction.site == this.Name
select auction;
IEnumerable<IAuction> res = q1.Cast<IAuction>();
return res;
}
}
catch(Exception e)
{
throw new UnavailableDbException("[GetAuctions]" + e.Message);
}
Someone can help me???
Tanks
Fabio
Yes - you're returning a result which will be lazily evaluated - but you're disposing of the data context which would be used to fetch the results.
Options:
Load the results eagerly, e.g. by calling ToList on the result
Don't dispose of the context (I don't know what the situation is in the Entity Framework; you could get away with this in LINQ to SQL, but it may not be a good idea in EF)
Dispose of the context when you're finished with the data
In this case I'd suggest using the first option - it'll be safe and simple. As you're already filtering the results and you're casting to IEnumerable<IAuction> anyway, you're unlikely to get the normal downsides of materializing the query early. (If it were still IQueryable<T>, you'd be throwing away the ability to add extra bits to the query and them still be translated to SQL.)

Resources