SqlAlchemy - when I iterate on a query, do I get a list or a iterator? - performance

I'm starting to learn how to use SQLAlchemy and I'm running into some efficiency problems.
I created an object mapping an existing big table on our Oracle database:
engine = create_engine(connectionString, echo=False)
class POI(object):
def __repr__(self):
return "{poi_id} - {title}, {city} - {uf}".format(**self.__dict__)
def loadSession():
metadata = MetaData(engine)
_poi = Table('tbl_ourpois', metadata, autoload = True)
mapper(POI, _poi)
Session = sessionmaker(bind = engine)
session = Session()
return session
This table have millions of registries. When I do a simple query and try to iterate over it:
session = loadSession()
for poi in session.query(POI):
print poi
I noticed two things: (1) it takes some minutes for it to start printing objects on the screen, (2) memory usage starts to grow like crazy. So, my conclusion was that this code was fetching all the result set in a list and then iterating over it. Is this correct?
With cx_Oracle, when I do a query like:
conn = cx_Oracle.connect(connectionString)
cursor = conn.cursor()
cursor.execute("select * from tbl_ourpois")
for poi in cursor:
print poi
the resulting cursor behaves as an iterator that gets results into a buffer and returns them as they are needed intead of loading the whole thing in a list. This loop starts printing results almost instantly and memory usage is pretty low and constant.
Can I get this kind of behavior wiht SQLAlchemy? Is there a way to get a constant memory iterator out of session.query(POI) instead of a list?

Related

Insert a list of objects into my room database at once, and the order of objects is changed

I tried to insert a list of objects from my legacy litepal database into my room database, and when I retrieved them from my room database, I found the order of my objects is no longer the same as that of my old list.
Below is my code:
// litepal is my legacy database
val litepalNoteList = LitePal.findAll(Note::class.java)
if (litepalNoteList.isNotEmpty()) {
litepalNoteList.forEach { note ->
// before insertion, I want to trun my legacy note objects into Traininglog objects
// note and Traninglog are of different types, but their content should be the same
val noteContent = note.note_content
val htmlContent = note.html_note_content
val createdDate = note.created_date
val isMarked = note.isLevelUp
val legacyLog = TrainingLog(
noteContent = noteContent,
htmlLogContent = htmlContent,
createdDate = createdDate,
isMarked = isMarked)
logViewModel.viewModelScope.launch(Dispatchers.IO) {
trainingLogDao.insertNewTrainingLog(legacyLog)
} // the end of forEach
}
The problem is that in my room database, the order of TraningLog objects differs randomly from that of my old list in the Litepal database.
Anyone konw why is this happening?
If the order matters, then you should extract data using the ORDER BY phrase. Otherwise you are leaving the order up to the query optimiser.
So say instead of #Query("SELECT * FROM trainingLog") then you could ORDER the result by using #Query("SELECT * FROM trainingLog ORDER BY createdDate ASC")
The efficiency of extracting the above would be improved by having an index on the createdDate column/field (in room #ColumnInfo(index = true)). However, it should be noted that there are overheads to having an index. Insertions and deletions and updates may incur additional processing to maintain the index. Additionally an index uses more space.
You may wish to have an insert function that can take a list rather than run multiple threaded inserts. Room will then (I believe) do all the inserts in a single transaction (1 disk write instead of many).
e.g.
instead of or as well as
#Insert
fun insert(trainingLog: TrainingLog): Long
you could have
#Insert
fun insert(trainingLogList: List<TrainingLog>): LongArray
Then all you need to do is build the List in your loop and then after the loop invoke the single insert.

How to get a SOQL query out of a for loop

Code added.
I have searched endlessly for a solution here and cannot find one, please help!
I have three objects (A, B, and C). A has a lookup to B, and B is the master to C (detail). Both A and C have many records related to each B record.
I want to have a job run that gets a subset of records from object C (it will usually be around 5,000 records). Then go through each of those and get the records on Object A that lookup to the same Object B record, summarize an Object A number field, and put that on the C record.
I have successfully gotten this to work in small scale, <100 Object C records. But each Object C record requires a new SOQL query since I am iterating through them in a for loop after I get all the Object C records. Plus I know this it is not best practice to ever have a query in a loop.
How can I get this to work? Since the records share the relationship with Object B, is there another way to get the data from the Object A records that match? Or is there some way to pull two lists, one Object C and one Object A. Then summarize the Object A records and line the lists up some how?
Thanks in advance!
Code:
public class nightlyJob {
public static void updateNumbers(){
integer I = 29;
List<ObjectC__c> CUpdateList = new List<ObjectC__c>();
List<ObjectC__c> CpullList =
[SELECT ID, Index__c, ObjectB__r.id
FROM ObjectC__c
WHERE Index__c = :I];
for(ObjectC__c s : CpullList){
List<ObjectA__c> AList =
[SELECT ObjectB__c, Number__c
FROM ObjectA__c
WHERE ObjectB__c = :s.ObjectB__r.Id];
decimal NumSum = 0;
for(ObjectA__c a : AList){
NumSum = a.Number__c + NumSum;
}
s.Num__c = NumSum;
CUpdateList.add(s);
}
update CUpdateList;
}
}
It looks like you are really missing several fundamental concepts at the moment.
The biggest problem you are up against in SFDC development is that "database" operations are very expensive and are strictly limited. It's not just a matter of "best practice": if in a single transaction you exceed these limits -- number of SOQL calls, number of records returned, number of records updated, number of DML statements, etc. -- your transaction will fail. For details, search online for "Salesforce Execution Governors and Limits".
You can write code that works within these limitations, but there is a bit of a learning curve.
First, learn to use collections with SOQL queries to get your SOQL queries out of loops. This is a.k.a. "bulkfication" and it fundamental to SFDC development:
List<ObjectC__c> CpullList =
[SELECT ID, Index__c, ObjectB__r.id
FROM ObjectC__c
WHERE Index__c = :I];
// Create a map with the results of this query.
// key=ObjectC__c.Id, value = Object__c record
Map<Id, ObjectC__c> objCmap = Map<Id, ObjectC__c>(CpullList);
// Build a set of all the Object_B id's from this result set
Set<Id> objBids = new Set<Id>();
for (ObjectC__c record : CpullList) {
objBids.add(record.ObjectB__r.id);
}
// Now you can use only one SOQL query instead of a loop
List<ObjectA> AList = [SELECT ObjectB__c, Number__c
FROM ObjectA__c
WHERE ObjectB__c in:objBids];
Next, use "SOQL aggregate functions" whenever you can. Example: in your code here, you could use "SUM()" and "group by" instead of performing these calculations with loops:
// Get the sum of ObjectA__c.Number__c for each Object B in objBIds
AggregateResult[] groupedResults = [select ObjectB__c,
sum(Number__c) sumA
from ObjectA__c
where ObjectB__c in: objBids
group by ObjectB__c];
for (AggregateResult ar : groupedResults) {
System.debug('Object B Id' + ar.get('Objectb__c'));
System.debug('Sum of ObjectA__c.Number__c' + ar.get('sumA'));
// Here, you might want to build a Map<Id, Integer> sumAmap:
// key=Object B ID, value=sumA
// and then use it along with objCmap to build a collection of Object C's
// for your update statement...
}
You can continue this process and apply these ideas to make the code more efficient.
But even after you have your methods working as efficiently as possible, you still may run into limits due to the number of records you're dealing with. At that point, you will need to learn about the Batchable interface, the Queuable interface and #future calls (how to process a larger number of records, split across transactions) That's really too much to information to cover in a single SO answer.

Select Count very slow using EF with Oracle

I'm using EF 5 with Oracle database.
I'm doing a select count in a table with a specific parameter. When I'm using EF, the query returns the value 31, as expected, But the result takes about 10 seconds to be returned.
using (var serv = new Aperam.SIP.PXP.Negocio.Modelos.SIP_PA())
{
var teste = (from ens in serv.PA_ENSAIOS_UM
where ens.COD_IDENT_UNMET == "FBLDY3840"
select ens).Count();
}
If I execute the simple query bellow the result is the same (31), but the result is showed in 500 milisecond.
SELECT
count(*)
FROM
PA_ENSAIOS_UM
WHERE
COD_IDENT_UNMET 'FBLDY3840'
There are a way to improve the performance when I'm using EF?
Note: There are 13.000.000 lines in this table.
Here are some things you can try:
Capture the query that is being generated and see if it is the same as the one you are using. Details can be found here, but essentially, you will instantiate your DbContext (let's call it "_context") and then set the Database.Log property to be the logging method. It's fine if this method doesn't actually do anything--you can just set a breakpoint in there and see what's going on.
So, as an example: define a logging function (I have a static class called "Logging" which uses nLog to write to files)
public static void LogQuery(string queryData)
{
if (string.IsNullOrWhiteSpace(queryData))
return;
var message = string.Format("{0}{1}",
queryData.Trim().Contains(Environment.NewLine) ?
Environment.NewLine : "", queryData);
_sqlLogger.Info(message);
_genLogger.Trace($"EntityFW query (len {message.Length} chars)");
}
Then when you create your context point to LogQuery:
_context.Database.Log = Logging.LogQuery;
When you do your tests, remember that often the first run is the slowest because the server has to actually do the work, but on the subsequent runs, it often uses cached data. Try running your tests 2-3 times back to back and see if they don't start to run in the same time.
I don't know if it generates the same query or not, but try this other form (which should be functionally equivalent, but may provide better time)
var teste = serv.PA_ENSAIOS_UM.Count(ens=>ens.COD_IDENT_UNMET == "FBLDY3840");
I'm wondering if the version you have pulls data from the DB and THEN counts it. If so, this other syntax may leave all the work to be done at the server, where it belongs. Not sure, though, esp. since I haven't ever used EF with Oracle and I don't know if it behaves the same as SQL or not.

Grails hibernate session in batches

GORM works fine out of the box as long as there is no batch with more than 10.000 objects. Without optimisation you will face the outOfMemory problems.
The common solution is to flush() and clear() the session each n (e.g.n=500) objects:
Session session = sessionFactory.currentSession
Transaction tx = session.beginTransaction();
def propertyInstanceMap = org.codehaus.groovy.grails.plugins.DomainClassGrailsPlugin.PROPERTY_INSTANCE_MAP
Date yesterday = new Date() - 1
Criteria c = session.createCriteria(Foo.class)
c.add(Restrictions.lt('lastUpdated',yesterday))
ScrollableResults rawObjects = c.scroll(ScrollMode.FORWARD_ONLY)
int count=0;
while ( rawObjects.next() ) {
def rawOject = rawObjects.get(0);
fooService.doSomething()
int batchSize = 500
if ( ++count % batchSize == 0 ) {
//flush a batch of updates and release memory:
try{
session.flush();
}catch(Exception e){
log.error(session)
log.error(" error: " + e.message)
throw e
}
session.clear();
propertyInstanceMap.get().clear()
}
}
session.flush()
session.clear()
tx.commit()
But there are some problems I can't solve:
If I use currentSession, then the controller fails because of session is empty
If I use sessionFactory.openSession(), then the currentSession is still used inside FooService. Of cause I can use the session.save(object) notation. But this means, that I have to modify fooService.doSomething() and duplicate code for single operation (common grails notation like fooObject.save() ) and batch operation (session.save(fooObject() ).. notation).
If I use Foo.withSession{session->} or Foo.withNewSession{session->}, then the objects of Foo Class are cleared by session.clear() as expected. All the other objects are not cleared(), what leads to memory leak.
Of cause I can use evict(object) to manualy clear the session. But it is nearly impossible to get all relevant objects, due to autofetching of assosiations.
So I have no idea how to solve my problems without making the FooService.doSomething() more complex. I'm looking for something like withSession{} for all domains. Or to save session at the begin (Session tmp = currentSession) and do something like sessionFactory.setCurrentSession(tmp). Both doesn't exists!
Any idea is wellcome!
I would recommend to use stateless session for this kind of batch processing. See this post: Using StatelessSession for Batch processing
A modified approach to what you are doing would be:
Loop over your entire collection (rawObjects) and save a list of all the ids for those objects.
Loop over the list of ids. At each iteration, look up just that single object, by its id.
Then use the same periodic clearing of the session cache like you are doing now.
By the way, someone else has suggested an approach similar to yours. But note that the code in this link is incorrect; the lines that clear the session should be inside the if statement, just like you have in your solution.

NHibernate IQueryable doesn't seem to delay execution

I'm using NHibernate 3.2 and I have a repository method that looks like:
public IEnumerable<MyModel> GetActiveMyModel()
{
return from m in Session.Query<MyModel>()
where m.Active == true
select m;
}
Which works as expected. However, sometimes when I use this method I want to filter it further:
var models = MyRepository.GetActiveMyModel();
var filtered = from m in models
where m.ID < 100
select new { m.Name };
Which produces the same SQL as the first one and the second filter and select must be done after the fact. I thought the whole point in LINQ is that it formed an expression tree that was unravelled when it's needed and therefore the correct SQL for the job could be created, saving my database requests.
If not, it means all of my repository methods have to return exactly what is needed and I can't make use of LINQ further down the chain without taking a penalty.
Have I got this wrong?
Updated
In response to the comment below: I omitted the line where I iterate over the results, which causes the initial SQL to be run (WHERE Active = 1) and the second filter (ID < 100) is obviously done in .NET.
Also, If I replace the second chunk of code with
var models = MyRepository.GetActiveMyModel();
var filtered = from m in models
where m.Items.Count > 0
select new { m.Name };
It generates the initial SQL to retrieve the active records and then runs a separate SQL statement for each record to find out how many Items it has, rather than writing something like I'd expect:
SELECT Name
FROM MyModel m
WHERE Active = 1
AND (SELECT COUNT(*) FROM Items WHERE MyModelID = m.ID) > 0
You are returning IEnumerable<MyModel> from the method, which will cause in-memory evaluation from that point on, even if the underlying sequence is IQueryable<MyModel>.
If you want to allow code after GetActiveMyModel to add to the SQL query, return IQueryable<MyModel> instead.
You're running IEnumerable's extension method "Where" instead of IQueryable's. It will still evaluate lazily and give the same output, however it evaluates the IQueryable on entry and you're filtering the collection in memory instead of against the database.
When you later add an extra condition on another table (the count), it has to lazily fetch each and every one of the Items collections from the database since it has already evaluated the IQueryable before it knew about the condition.
(Yes, I would also like to be the extensive extension methods on IEnumerable to instead be virtual members, but, alas, they're not)

Resources