ActiveRecord in batches? after_commit produces O(n) trouble - ruby

I'm looking for a good idiomatic rails pattern or gem to handle the problem of inefficient after_commit model callbacks. We want to stick with a callback to guarantee data integrity but we'd like it to run once whether it's for one record or for a whole batch of records wrapped in a transaction.
Here's a use-case:
A Portfolio has many positions.
On Position there is an after_commit hook to re-calculate numbers in reference to its sibling positions in the portfolio.
That's fine for directly editing one position.
However...
There's we now have an importer for bringing in lots of positions spanning many portfolios in one big INSERT. So each invocation of this callback queries all siblings and it's invoked once for each sibling - so reads are O(n**2) instead of O(n) and writes are O(n) where they should be O(1).
'Why not just put the callback on the parent portfolio?' Because the parent doesn't necessarily get touched during a relevant update. We can't risk the kind of inconsistent state that could result from leaving a gap like that.
Is there anything out there which can leverage the fact that we're committing all the records at once in a transaction? In principle it shouldn't be too hard to figure out which records changed.
A nice interface would be something like after_batch_commit which might provide a light object with all the changed data or at least the ids of affected rows.
There are lots of unrelated parts of our app that are asking for a solution like this.

One solution could be inserting them all in one SQL statement then validating them afterwards.
Possible ways of inserting them in a single statement is suggested in this post.
INSERT multiple records using ruby on rails active record
Or you could even build the sql to insert all the records in one trip to the database.
The code could look something like this:
max_id = Position.maximum(:id)
Postion.insert_many(data) # not actual code
faulty_positions = Position.where("id > ?", max_id).reject(&:valid?)
remove_and_or_log_faulty_positions(faulty_positions)
This way you only have have to touch the database three times per N entries in your data. If it is large data sets it might be good to do it in batches as you mention.

Related

Android room insert/update result

I've been using room for a while. I'm from a mysql background where you have to check the values of queries and stuff. In room, I find this a bit complicated because so far I can either declare the dao insert query as void or as long returning the rowId
If I return a long, I have to write a listener to notify the UI of success/failure
My question is, is this necessary? Do I need the return value of inserts/updates/deletes or are these queries guaranteed to succeed?
My question is, is this necessary?
This depends and is hopefully better explained (a least a little anyway) below.
Do I need the return value of inserts/updates/deletes or are these
queries guaranteed to succeed?
There is no guarantee that they will succeed. However, you may be able to assume they have or use CONFLICT handling.
Much could depend upon how the Entities are coded, for example say you had a simple table (Entity) with an id and a name and for simplicity that you have autogenerate = true and you never allow the id to be specified when inserting. Unless the database is massive (beyond storage device capacity) or tweaked. A unique id will always result.
If the name needs to be UNIQUE then you are introducing a facet that makes it more likely that the insert will not succeed. If you had the onConflictStrategy of the insert as IGNORE, then a duplicate wouldn't fail but you'd may want to know if nothing was inserted (-1 returned).
This is just one facet. The answer is really that you need to consider the design of the database and of the app itself. Personally I'd always go with informing the user at least of the abnormal/unexpected which probably then means yes it is necessary (typically it is easier to supress code than add new code).

insert data from one table to two tables group by for Oracle

I have a situation where I need a large amount of data (9+ billion per day) data being collected in a loading table that has fields like
-TABLE loader
first_seen,request,type,response,hits
1232036346,mydomain.com,A,203.11.12.1,200
1332036546,ogm.com,A,103.13.12.1,600
1432039646,mydomain.com,A,203.11.12.1,30
that need to split into two tables (de-duplicated)
-TABLE final
request,type,response,hitcount,id
mydomain.com,A,203.11.12.1,230,1
ogm.com,A,103.13.12.1,600,2
and
-TABLE timestamps
id,times_seen
1,1232036346
2,1432036546
1,1432039646
I can create the schemas and do the select like
select request,type,response,sum(hitcount) from loader group by request,type,response;
get data into the final table. for best performance I want to see if I can use "insert all" to move data from the loader to these two tables and perhaps use triggers in the database to try to achieve this. Any ideas and recommendations on the best ways to solve this?
"9+ billion per day"
That's more than just a large number of rows: that's a huge number, and it will require special engineering to handle it.
For starters, you don't just need INSERT statements. The requirement to maintain the count for existing (request,type,response) tuples points to UPDATE too. The need to generate and return a synthetic key is problematic in this scenario. It rules out MERGE, the easiest way of implementing upserts (because the MERGE syntax doesn't support the RETURNING clause).
Beyond that, attempting to handle nine billion rows in a single transaction is a bad idea. How long will it take to process? What happens if it fails halfway through? You need to define a more granular unit of work.
Although, that raises some business issues. What do the users only want to see the whole picture, after the Close-Of-Day? Or would they derive benefit from seeing Intra-day results? If yes, how to distinguish Intra-day from Close-Of-Day results? If no, how to hide partially processed results whilst the rest is still in flight? Also, how soon after Close-Of-Day do they want to see those totals?
Then there are the architectural considerations. These figure mean processing over one hundred thousand (one lakh) rows every second. That requires serious crunch and expensive licensing extras. Obviously Enterprise Edition for parallel processing but also Partitioning and perhaps RAC options.
By now you should have an inkling why nobody answered your question straight-away. This is a consultancy gig not a StackOverflow question.
But let's sketch a solution.
We must have continuous processing of incoming raw data. So we stream records for loading into FINAL and TIMESTAMP tables alongside the LOADER table, which becomes an audit of the raw data (or else perhaps we get rid of the LOADER table altogether).
We need to batch the incoming records to leverage set-based operations. Depending on the synthetic key implementation we should aim for pure SQL, otherwise Bulk PL/SQL.
Keeping the thing going is vital so we need to pay attention to Bulk Error Handling.
Ideally the target tables can be partitioned, so we can load into offline tables and use Partition Exchange to bring the cleaned data online.
For the synthetic key I would be tempted to use a hash key based on the (request,type,response) tuple rather than a sequence, as that would give us the option to load TIMESTAMP and FINAL independently. (Collisions are extremely unlikely.)
Just to be clear, this is a bagatelle not a serious architecture. You need to experiment and benchmark various approaches against realistic volumes of data on Production-equivalent hardware.

Rails 3 - ActiveRecord, what is more efficient (update vs. count)?

Okay, lets say I got 2 different Models:
Poll (has_many :votes)
Vote (belongs_to :poll)
So, one poll can have many votes.
At the moment I'm displaying a list of all polls including its overall vote_count.
Everytime someone votes for a poll, I'm going to update the overall vote_count of the specific poll by using:
#poll = Poll.update(params[:poll_id], :vote_count => vote_count+1)
To retrieve the vote_count I use : #poll.vote_count which works fine.
Lets assume I got a huge amount of polls (in my db) and a lot of people would vote for the same poll at the same time.
Question : Wouldn't it be more efficient to remove the vote_count from the poll table and use: #vote_count = Poll.find(params[:poll_id]).votes.count for retrieving the overall vote_count instead? Which operation (update vs. count) would make more sense in this case?(I'm using postgresql in production)
Thanks for the help!
Have you considered using a counter cache (see counter_cache option)? Rails has this built in functionality to handle all the possible updates to an association and how that would affect the counter.
It's as simple as adding a 0 initialized integer column named #{attribute.pluralize}_count(in your case votes_count) on the table of one to many side of an association (in your case Poll).
And then on the other side of the association add the :counter_cache => true argument to the belongs to statement.
belongs_to :poll, :counter_cache => true
Now this doesn't answer your question exactly, and the correct answer will depend on the shape of your data and the indexes you've configured. If you're expecting your votes table to number in the millions spread out over thousands of polls, then go with the counter cache, otherwise just counting the associations should be fine.
This is a great question as it touches the fundamental issue of storing summary and aggregate information.
Generally this is not a good idea as things can easily get out of sync as systems grow.
Sometimes there are occasions when you do want summary information, but these are more specialized cases such as read-open databases that are only used for reporting and are updated once a day at midnight.
In those cases summary aggregate reporting is not only ok but is preferred over repeated summary/aggregate information calculation that would otherwise be done with each query. That will also depend on both usage and size, e.g. if there are 300 queries a day (against the once a day updated, read only database) and they all have to calculate the same totals, and each query reads 20,000 rows, it is more efficient to do that once and store that calculation. As the data and queries grow this may be the only practical way to allow complex reporting.
To me, it doesn't make more sense in such a simple case to use a vote_count in the Poll. Counting rows is really fast and if you add a vote and forget to increment vote_count then the data is kind of broken...

Can I substitute savepoints for starting new transactions in Oracle?

Right now the process that we're using for inserting sets of records is something like this:
(and note that "set of records" means something like a person's record along with their addresses, phone numbers, or any other joined tables).
Start a transaction.
Insert a set of records that are related.
Commit if everything was successful, roll back otherwise.
Go back to step 1 for the next set of records.
Should we be doing something more like this?
Start a transaction at the beginning of the script
Start a save point for each set of records.
Insert a set of related records.
Roll back to the savepoint if there is an error, go on if everything is successful.
Commit the transaction at the beginning of the script.
After having some issues with ORA-01555 and reading a few Ask Tom articles (like this one), I'm thinking about trying out the second process. Of course, as Tom points out, starting a new transaction is something that should be defined by business needs. Is the second process worth trying out, or is it a bad idea?
A transaction should be a meaningful Unit Of Work. But what constitutes a Unit Of Work depends upon context. In an OLTP system a Unit Of Work would be a single Person, along with their address information, etc. But it sounds as if you are implementing some form of batch processing, which is loading lots of Persons.
If you are having problems with ORA-1555 it is almost certainly because you are have a long running query supplying data which is being updated by other transactions. Committing inside your loop contributes to the cyclical use of UNDO segments, and so will tend to increase the likelihood that the segments you are relying on to provide read consistency will have been reused. So, not doing that is probably a good idea.
Whether using SAVEPOINTs is the solution is a different matter. I'm not sure what advantage that would give you in your situation. As you are working with Oracle10g perhaps you should consider using bulk DML error logging instead.
Alternatively you might wish to rewrite the driving query so that it works with smaller chunks of data. Without knowing more about the specifics of your process I can't give specific advice. But in general, instead of opening one cursor for 10000 records it might be better to open it twenty times for 500 rows a pop. The other thing to consider is whether the insertion process can be made more efficient, say by using bulk collection and FORALL.
Some thoughts...
Seems to me one of the points of the asktom link was to size your rollback/undo appropriately to avoid the 1555's. Is there some reason this is not possible? As he points out, it's far cheaper to buy disk than it is to write/maintain code to handle getting around rollback limitations (although I had to do a double-take after reading the $250 pricetag for a 36Gb drive - that thread started in 2002! Good illustration of Moore's Law!)
This link (Burleson) shows one possible issue with savepoints.
Is your transaction in actuality steps 2,3, and 5 in your second scenario? If so, that's what I'd do - commit each transaction. Sounds a bit to me like scenario 1 is a collection of transactions rolled into one?

Thread-safe unique entity instance in Core Data

I have a Message entity that has a messageID property. I'd like to ensure that there's only ever one instance of a Message entity with a given messageID. In SQL, I'd just add a unique constraint to the messageID column, but I don't know how to do this with Core Data. I don't believe it can be done in the data model itself, so how do you go about it?
My initial thought is to use a validation method to do a fetch on the NSManagedObject's context for the ID, see if it finds anything but itself, and if so, fails the validation. I suspect this will work - but I'm worried about the performance of something like that. I went through a lot of effort to minimize the fetch requests needed for the entire import routine, and having it validate by performing a fetch for every single new message entity seems a bit excessive. I can get all pre-existing objects I need and identify all the new objects I need to insert into the store using just two fetch queries before I do the actual work of importing and connecting everything together. This would add a fetch to every single update or insert in addition to those two - which would seem to eliminate any performance advantage I had by pre-processing the import data in the first place!
The main reason this is an issue is that the importer can (potentially) run several batches concurrently on several threads and may include some overlapping/duplicate data that needs to ultimately result in just one object in the store and not duplicate entries. Is there a reasonable way to do this and does what I'm asking for make sense for Core Data?
The only way to guarantee uniqueness is to do a fetch. Fortunately you can just do a -countForFetchRequest:error: and check to see if it is zero or not. That is the least expensive way to guarantee uniqueness at this time.
You can probably accomplish this in the validation or run it in the loop that is processing the data. Personally I would do it above the creation of the NSManagedObject so that you do not have the unnecessary allocs when a record already exists.
I don't think there is a way to easily guarantee an attribute is unique without doing a lot of work on your own. You can, of course use CFUUIDCreate to create a globally unique UUID, which should be unique, even in a multithreaded environment. But...
The objectID (type NSManagedObjectID) of all managed objects is guaranteed to be unique within the persistent store coordinator. Since you can add arbitrarily many persistent stores to the coordinator, this guarantee basically guarantees that the objectIDs are globally unique. Why don't you use the objectID as your messageID? You can't, of course, change the objectID once it's assigned (and it won't get assigned until the context containing the inserted object is saved; until then it will be a temporary but still unique ID).
So you have a NSManagedContext for each thread, backed by the same persistent store, is that correct? And before you save the NSManagedContext, you'd like to make sure the messageID is unique, that is, that you are not updating an existing row, and that it is not in one of the other contexts, correct?
Given that model (correct me if I misunderstand), I think you'd be better served having one object that manages access to the persistent store. That way, all threads would update one context and you can do your validation in there, using Marcus's -countForFetchRequest:error: suggestion. Granted, that places a bottleneck on this operation.
Just to add my 2 cents: I think inconsistencies will occur sooner or later anyway, and the only way to mitigate them seems to be to do it on an application-level with rather complex code.
So in my case I decided to allow duplicate values for what are supposed to be "unique" fields.
I added code, however, that detects these problems later (e.g. when a fetch that should return 1 object returns more than 1) and fixes them when they occur (usually by deleting).
It's a "go ahead, make a mistake, ill fix it later for you"-strategy.
This is not ideal, of course, but a valid way to attack this problen, imho.

Resources