Rails 3 - ActiveRecord, what is more efficient (update vs. count)? - ruby

Okay, lets say I got 2 different Models:
Poll (has_many :votes)
Vote (belongs_to :poll)
So, one poll can have many votes.
At the moment I'm displaying a list of all polls including its overall vote_count.
Everytime someone votes for a poll, I'm going to update the overall vote_count of the specific poll by using:
#poll = Poll.update(params[:poll_id], :vote_count => vote_count+1)
To retrieve the vote_count I use : #poll.vote_count which works fine.
Lets assume I got a huge amount of polls (in my db) and a lot of people would vote for the same poll at the same time.
Question : Wouldn't it be more efficient to remove the vote_count from the poll table and use: #vote_count = Poll.find(params[:poll_id]).votes.count for retrieving the overall vote_count instead? Which operation (update vs. count) would make more sense in this case?(I'm using postgresql in production)
Thanks for the help!

Have you considered using a counter cache (see counter_cache option)? Rails has this built in functionality to handle all the possible updates to an association and how that would affect the counter.
It's as simple as adding a 0 initialized integer column named #{attribute.pluralize}_count(in your case votes_count) on the table of one to many side of an association (in your case Poll).
And then on the other side of the association add the :counter_cache => true argument to the belongs to statement.
belongs_to :poll, :counter_cache => true
Now this doesn't answer your question exactly, and the correct answer will depend on the shape of your data and the indexes you've configured. If you're expecting your votes table to number in the millions spread out over thousands of polls, then go with the counter cache, otherwise just counting the associations should be fine.

This is a great question as it touches the fundamental issue of storing summary and aggregate information.
Generally this is not a good idea as things can easily get out of sync as systems grow.
Sometimes there are occasions when you do want summary information, but these are more specialized cases such as read-open databases that are only used for reporting and are updated once a day at midnight.
In those cases summary aggregate reporting is not only ok but is preferred over repeated summary/aggregate information calculation that would otherwise be done with each query. That will also depend on both usage and size, e.g. if there are 300 queries a day (against the once a day updated, read only database) and they all have to calculate the same totals, and each query reads 20,000 rows, it is more efficient to do that once and store that calculation. As the data and queries grow this may be the only practical way to allow complex reporting.

To me, it doesn't make more sense in such a simple case to use a vote_count in the Poll. Counting rows is really fast and if you add a vote and forget to increment vote_count then the data is kind of broken...

Related

ActiveRecord in batches? after_commit produces O(n) trouble

I'm looking for a good idiomatic rails pattern or gem to handle the problem of inefficient after_commit model callbacks. We want to stick with a callback to guarantee data integrity but we'd like it to run once whether it's for one record or for a whole batch of records wrapped in a transaction.
Here's a use-case:
A Portfolio has many positions.
On Position there is an after_commit hook to re-calculate numbers in reference to its sibling positions in the portfolio.
That's fine for directly editing one position.
However...
There's we now have an importer for bringing in lots of positions spanning many portfolios in one big INSERT. So each invocation of this callback queries all siblings and it's invoked once for each sibling - so reads are O(n**2) instead of O(n) and writes are O(n) where they should be O(1).
'Why not just put the callback on the parent portfolio?' Because the parent doesn't necessarily get touched during a relevant update. We can't risk the kind of inconsistent state that could result from leaving a gap like that.
Is there anything out there which can leverage the fact that we're committing all the records at once in a transaction? In principle it shouldn't be too hard to figure out which records changed.
A nice interface would be something like after_batch_commit which might provide a light object with all the changed data or at least the ids of affected rows.
There are lots of unrelated parts of our app that are asking for a solution like this.
One solution could be inserting them all in one SQL statement then validating them afterwards.
Possible ways of inserting them in a single statement is suggested in this post.
INSERT multiple records using ruby on rails active record
Or you could even build the sql to insert all the records in one trip to the database.
The code could look something like this:
max_id = Position.maximum(:id)
Postion.insert_many(data) # not actual code
faulty_positions = Position.where("id > ?", max_id).reject(&:valid?)
remove_and_or_log_faulty_positions(faulty_positions)
This way you only have have to touch the database three times per N entries in your data. If it is large data sets it might be good to do it in batches as you mention.

insert data from one table to two tables group by for Oracle

I have a situation where I need a large amount of data (9+ billion per day) data being collected in a loading table that has fields like
-TABLE loader
first_seen,request,type,response,hits
1232036346,mydomain.com,A,203.11.12.1,200
1332036546,ogm.com,A,103.13.12.1,600
1432039646,mydomain.com,A,203.11.12.1,30
that need to split into two tables (de-duplicated)
-TABLE final
request,type,response,hitcount,id
mydomain.com,A,203.11.12.1,230,1
ogm.com,A,103.13.12.1,600,2
and
-TABLE timestamps
id,times_seen
1,1232036346
2,1432036546
1,1432039646
I can create the schemas and do the select like
select request,type,response,sum(hitcount) from loader group by request,type,response;
get data into the final table. for best performance I want to see if I can use "insert all" to move data from the loader to these two tables and perhaps use triggers in the database to try to achieve this. Any ideas and recommendations on the best ways to solve this?
"9+ billion per day"
That's more than just a large number of rows: that's a huge number, and it will require special engineering to handle it.
For starters, you don't just need INSERT statements. The requirement to maintain the count for existing (request,type,response) tuples points to UPDATE too. The need to generate and return a synthetic key is problematic in this scenario. It rules out MERGE, the easiest way of implementing upserts (because the MERGE syntax doesn't support the RETURNING clause).
Beyond that, attempting to handle nine billion rows in a single transaction is a bad idea. How long will it take to process? What happens if it fails halfway through? You need to define a more granular unit of work.
Although, that raises some business issues. What do the users only want to see the whole picture, after the Close-Of-Day? Or would they derive benefit from seeing Intra-day results? If yes, how to distinguish Intra-day from Close-Of-Day results? If no, how to hide partially processed results whilst the rest is still in flight? Also, how soon after Close-Of-Day do they want to see those totals?
Then there are the architectural considerations. These figure mean processing over one hundred thousand (one lakh) rows every second. That requires serious crunch and expensive licensing extras. Obviously Enterprise Edition for parallel processing but also Partitioning and perhaps RAC options.
By now you should have an inkling why nobody answered your question straight-away. This is a consultancy gig not a StackOverflow question.
But let's sketch a solution.
We must have continuous processing of incoming raw data. So we stream records for loading into FINAL and TIMESTAMP tables alongside the LOADER table, which becomes an audit of the raw data (or else perhaps we get rid of the LOADER table altogether).
We need to batch the incoming records to leverage set-based operations. Depending on the synthetic key implementation we should aim for pure SQL, otherwise Bulk PL/SQL.
Keeping the thing going is vital so we need to pay attention to Bulk Error Handling.
Ideally the target tables can be partitioned, so we can load into offline tables and use Partition Exchange to bring the cleaned data online.
For the synthetic key I would be tempted to use a hash key based on the (request,type,response) tuple rather than a sequence, as that would give us the option to load TIMESTAMP and FINAL independently. (Collisions are extremely unlikely.)
Just to be clear, this is a bagatelle not a serious architecture. You need to experiment and benchmark various approaches against realistic volumes of data on Production-equivalent hardware.

How to speed up performance by avoiding to query Mongoid multiple times?

I have approx. 10 million Article objects in a Mongoid database. The huge number of Article objects makes the queries quite time consuming to perform.
As exemplified below, I am registering for each week (e.g. 700 days from now .. 7 days from now, 0 days from now) how many articles are in the database.
But for every query I make, the time consumption is increased, and Mongoid's CPU usage quickly reaches +100%.
articles = Article.where(published: true).asc(:datetime)
days = Date.today.mjd - articles.first.datetime.to_date.mjd
days.step(0, -7) do |n|
current_date = Date.today - n.days
previous_articles = articles.lt(datetime: current_date)
previous_good_articles = previous_articles.where(good: true).size
previous_bad_articles = previous_articles.where(good: false).size
end
Is there a way to save the Article objects to memory, so only need to call the database on the first line?
A MongoDB database is not build for that.
I think the best way is to run daily a script that creates your data for that day and save it in a Redis Database http://www.redis.io
Redis stores your data in the server memory, so you can access it every time of the day.
And is very quick.
Don't Repeat Yourself (DRY) is a best-practice that applies not only to code but also to processing. Many applications have natural epochs for summarizing data, a day is a good choice in your question, and if the data is historical, it only has to be summarized once. So you reduce processing of 10 million Article document down to 700 day-summary documents. You need special code for merging in today if you want up to the moment accurate data, but the previous savings is well worth the effort.
I politely disagree with the statement, "A MongoDB database is not build for that." You can see from the above that it is all about not repeating processing. The 700 day-summary documents can be stored in any reasonable data store. Since you already are using MongoDB, simply use another MongoDB collection for the day summaries. There's no need to spin up another data store if you don't want to. The summary data will easily fit in memory, and the reduction in processing means that your working set size will no longer be blown out by the historical processing.

Joomla getItems default Pagination

Can anyone tell me if the getItems() function in the model automatically adds the globally set LIMIT before it actions the query (from getListQuery()). Joomla is really struggling, seemingly trying to cache the entire results (over 1 million records here!).
After looking in /libraries/legacy/model/list.php AND /libraries/legacy/model/legacy.php it appears that getItems() does add LIMIT to setQuery using $this->getState('list.limit') before it sends the results to the cache but if this is the case - why is Joomla struggling so much.
So what's going on? How come phpMyAdmin can return the limited results within a second and Joomla just times out?
Many thanks!
If you have one million records, you'll most definitely want to do as Riccardo is suggesting, override and optimize the model.
JModelList runs the query twice, once for the pagination numbers and then for the display query itself. You'll want to carefully inherit from JModellist to avoid the pagination query.
Also, the articles query is notorious for it's joins. You can definitely lose some of that slowdown (doubt you are using the contacts link, for example).
If all articles are visible to public, you can remove the ACL check - that's pretty costly.
There is no DBA from the West or the East who is able to explain why all of those GROUP BY's are needed, either.
Losing those things will help considerably. In fact, building your query from scratch might be best.
It does add the pagination automatically.
Its struggling is most likely due to a large dataset (i.e. 1000+ items returned in the collection) and many lookup fields: the content modules for example join as many as 10 tables, to get author names etc.
This can be a real killer, I had queries running for over one second with a dedicated server and only 3000 content items. One tag cloud component we found could take as long as 45 seconds to return a keywords list. If this is the situation (a lot of records and many joins), your only way out is to further limit the filters in the options to see if you can get some faster results (for example, limiting to articles in the last 3 months can reduce the time needed dramatically).
But if this is not sufficient or not viable, you're left with writing a new optimized query in a new model, which ultimately will bring the best performance optimization of any other optimization. In writing the query, consider leveraging the database specific optimizations, i.e. adding indexes, full-text indexes and only use joins if you really need them.
Also consider that joins must never grow with the number of fields, translations or else.
A constant query is easy for the db engine to optimize and cache, whilst a dynamic query will never be as efficient.

Cost of time-stamping as a method of concurrency control with Entity Framework

In concurrency, in optimistic concurrency the way to control the concurrency is using a timestamp field. However, in my particular case, not all the fields need to be controlled in respect to concurrency.
For example, I have a products table, holding the amount of stock. This table has fields like description, code... etc. For me, it is not a problem that one user modifies these fields, but I have to control if some other user changes the stock.
So if I use a timestamp and one user changes the description and another changes the amount of stock, the second user will get an exception.
However, if I use the field stock instead of concurrency exception, then the first user can update the information and the second can update the stock without problems.
Is it a good solution to use the stock field to control concucrrency or is it better to always use a timestamp field?
And if in the future I need to add a new important field, then I need to use two fields to control concurrency for stock and the new one? Does it have a high cost in terms of performance?
Consider the definition of optimistic concurrency:
In the field of relational database management systems, optimistic concurrency control (OCC) is a concurrency control method that assumes that multiple transactions can complete without affecting each other, and that therefore transactions can proceed without locking the data resources that they affect. (Wikipedia)
Clearly this definition is abstract and leaves a lot of room for your specific implementation.
Let me give you an example. A few years back I evaluated the same thing with a bunch of colleagues and we realized that in our application, on some of the tables, it was okay for the concurrency to simply be based on the fields the user was updating.
So, in other words, as long as the fields they were updating hadn't changed since they gathered the row, we'd let them update the row because the rest of the fields really didn't matter and and row was going to get refreshed on udpate anyway so they would get the most recent changes by other users.
So, in short, I would say what you're doing is just fine and there aren't really any hard and fast rules. It really depends on what you need. If you need it to be more flexible, like what you're talking about, then make it more flexible -- simple.

Resources