Are there any Ruby ORMs which use cursors or smart fetch? - ruby

I'm looking for a Ruby ORM to replace ActiveRecord. I've been looking at Sequel and DataMapper. They look pretty good however none of them seems to do the basic: not loading everything in memory when you don't need it.
I mean I've tried the following (or equivalent) on ActiveRecord and Sequel on table with lots of rows:
posts.each { |p| puts p }
Both of them go crazy on memory. They seem to load everything in memory rather than fetching stuff when needed. I used the find_in_batches in ActiveRecord, but it's not an acceptable solution:
ActiveRecord is not an acceptable solution because we had too many problems with it.
Why should my code be aware of a paging mechanism? I'm happy to configure somewhere the size of the page but that's it. With find_in_batches you need to do something like:
post.find_in_batches { |batch| batch.each { |p| puts p } }
But that should be transparent.
So is there somewhere a reliable Ruby ORM which does the fetch properly?
Update:
As Sergio mentioned, in Rails 3 you can use find_each which exactly what I want. However as ActiveRecord is not an option, except if someone can really convince me to use it, the questions are:
Which ORMs support the equivalent of find_each?
How to do it?
Why do we need a find_each, while find should do it, shouldn't it?

Sequel's Dataset#each does yield individual rows at a time, but most database drivers will load the entire result in memory first.
If you are using Sequel's Postgres adapter, you can choose to use real cursors:
posts.use_cursor.each{|p| puts p}
This fetches 1000 rows at a time by default, but you can use an option to specify the amount of rows to grab per cursor fetch:
posts.use_cursor(:rows_per_fetch=>100).each{|p| puts p}
If you aren't using Sequel's Postgres adapter, you can use Sequel's pagination extension:
Sequel.extension :pagination
posts.order(:id).each_page(1000){|ds| ds.each{|p| puts p}}
However, like ActiveRecord's find_in_batches/find_each, this does separate queries, so you need to be careful if there are concurrent modifications to the dataset you are retrieving.
The reason this isn't the default in Sequel is probably the same reason it isn't the default in ActiveRecord, which is that it isn't a good default in the general case. Only queries with large result sets really need to worry about it, and most queries don't return large result sets.
At least with the Postgres adapter cursor support, it's fairly easy to make it the default for your model:
Post.dataset = Post.dataset.use_cursor
For the pagination extension, you can't really do that, but you can wrap it in a method that makes it mostly transparent.

Sequel.extension :pagination
posts.order(:id).each_page(1000) do |ds|
ds.each { |p| puts p }
end
It is very very slow on large tables!
It becomes clear, looked at the method body:
http://sequel.rubyforge.org/rdoc-plugins/classes/Sequel/Dataset.html#method-i-paginate
# File lib/sequel/extensions/pagination.rb, line 11
def paginate(page_no, page_size, record_count=nil)
raise(Error, "You cannot paginate a dataset that already has a limit") if #opts[:limit]
paginated = limit(page_size, (page_no - 1) * page_size)
paginated.extend(Pagination)
paginated.set_pagination_info(page_no, page_size, record_count || count)
end

ActiveRecord actually has an almost transparent batch mode:
User.find_each do |user|
NewsLetter.weekly_deliver(user)
end

This code works faster than find_in_batches in ActiveRecord
id_max = table.get(:max[:id])
id_min = table.get(:min[:id])
n=1000
(0..(id_max-id_min)/n).map.each do |i|
table.filter(:id >= id_min+n*i, :id < id_min+n*(i+1)).each {|row|}
end

Maybe you can consider Ohm, that is based on Redis NoSQL store.

Related

Is there a DB independent way to get Db2 SQL results in Ruby?

I have Rake code which I've cobbled together from some sample code in my shop, and the advice of another programmer. It looks more or less as follows:
class Ticket666StupidDb2Test < ActiveRecord::Migration
def up
lookup('WXYZ3529300')
end
def down
puts "down, boy!"
end
def lookup(xiskid)
qresult = exec_query("SELECT DISTINCT SNARK, GLOOPEE FROM VITA_XISK WHERE XISKID = '#{xiskid}'")
while row = IBM_DB.fetch_array(qresult) do
snark = row[0]
gloopee = row[1]
puts "snark = #{snark}; gloopee = #{gloopee}"
end
end
end
Is there a way to use the result set column names in getting the data, instead of integer indices? I looked at http://rubyibm.rubyforge.org/docs/adapter/2.5.9/doc/ActiveRecord/ConnectionAdapters/IBM_DBAdapter.html and found at least one method, fetch_data, which seemed as though it would allow one to reference column names, but every attempt to try to use it produced an error such as:
rake aborted!
undefined method `fetch_data' for #<ActiveRecord::ConnectionAdapters::IBM_DBAdapter:0x38c16c0>
What is wrong with this code, which was my first attempt:
x = exec_query("Select ... blah blah blah...")
x.each do |row|
puts row['SNARK']
end
This was modeled on stuff documented at http://api.rubyonrails.org/classes/ActiveRecord/Result.html, but it failed with a similar exception that x.each is undefined.
Regardless, the notion that I have to resort to anything like IBM_DB.whathaveyou strikes me as obscene. Either IBM doesn't believe in making a standards conforming driver (not my experience using the Db2 driver for Java JDBC), or I just don't know what the supported, straightforward approach to this is.
Can someone tell me?
ruby -v returns ruby 1.9.3p545 (2014-02-24) [i386-mingw32].
What is wrong with x = exec_query("Select ... blah blah blah...")? If your query is not SQL compliant then you've tied yourself to the DBM and will have a lot more work when you need to migrate to some other DBM. That's the whole point of using an ORM, to separate the code from the query. It also means you'll need to stand up an instance of DB2 for development, test and production because your code can't adjust for scenarios like using SQLite for development locally, MySQL or PostgreSQL in test and DB2 in production. That's really easy with a good ORM.
Using or renaming IBM_DB isn't a significant issue in my experience. If you don't like calling it IBM_DB, assign that constant to another that is more visually pleasing. Sequel code typically uses DB but it's up to us what we want to call it. It's just a constant holding the connection information.
Having to index into an array is the result of using fetch_array. Typically, instead of using integers to identify fields, I'd define constants that are more symbolic/mnemonic and use them in place of the numbers. But, again, a good ORM makes it easy to use classes/models based on the table schema, adding layers of convenience and readability to isolate you from the uglier lower level driver's API.
Look at using Active Record or Sequel. Sequel supports DB2 nicely and is easily as powerful as Active Record and works with Rails. Both also work well in a non-Rails script; Sequel is my favorite for that but YMMV.

How to read value in cell from database

I am still fairly new to Ruby and to databases in general, and am trying to better learn how to use the two together. I have browsed through several online tutorials but haven't been able to figure a few things out. I am working with PostgreSQL and am simply trying to read the data in my database and manipulate in some way the data contained in the actual cell. From a tutorial I have the following functions:
def queryUserTable
#conn.exec( "SELECT * FROM users" ) do |result|
result.each do |row|
yield row if block_given?
end
end
end
and a simple way to print out the information in the rows would be something like
p.queryUserTable {|row| printf("%s %s\n", row['first_name'], row['last_name'])}
(with p being the connection). However all this is doing it printing out each value in the row and column specified as a whole, then continuing to the next row. What I would like to know is how I can grab for instance the value in row 1 under column first name and use it for something else? From what I understand, it looks like the rows are hashes and so I should be able to do something similar to {|row, value| #my_var = value } but I get no results by doing so, so I am not understanding how this all works properly. I am hoping someone can better explain how this works. Hope that makes sense. Thanks!
EDIT:
Does it have anything to do with this line in my function?:
result.each do |row| #do I need to add |row,value| here as well?
Is there a reason you're not using an ORM like ActiveRecord? Although it certainly has some downsides, it may well be helpful for someone who is new to databases and ruby. If you want a tutorial on active record and rails, I highly recommend Michael Hartl's awesome free tutorial[1].
I'm not exactly sure what you're trying to do, but I can correct a couple of misconceptions. First of all, result is not a hash - it is an array of hashes. That is why doing result.each { |row, value| ... doesn't initialize value. Once you have an individual row, you can do row.each { |col_name, val| ...
Second, if you want to grab a value from a specific row, you should specify the row in the query. You must know something about the row you want information about. For getting the user with id = 1, for instance:
user = #conn.exec("SELECT first_name FROM users WHERE id = 1").first
unless user.nil?
# do something with user["first_name"]
If you were to use activerecord, you could just do
user = User.findById(1)
I would not want to set the value in the queryUserTable loop, because it will get set on each loop, and just retain the value of the last time it executes.
[1] https://www.railstutorial.org/book

RhoMobile 13,000 inserts causing issues due to time

I have a problem (due to time) when inserting around 13,000 records into the devices database.
Is there any way to optimize this? Is it possible to put these all into one transaction (as I believe that it is currently creating one transaction per insert (which apparently has a diabolical effect on speed)).
Currently this takes around 10 minutes, this includes converting CSV to a hash (this doesn't seem to be the bottleneck).
Stupidly I am not using RhoSync...
Thanks
Set up a transaction around the inserts and then only commit at the end.
From their FAQ.
http://docs.rhomobile.com/faq#how-can-i-seed-a-large-amount-of-data-into-my-application-with-rhom
db = ::Rho::RHO.get_src_db('Model')
db.start_transaction
begin
items.each do |item|
# create hash of attribute/value pairs
data = {
:field1 => item['value1'],
:field2 => item['value2']
}
# Creates a new Model object and saves it
new_item = Model.create(data)
end
db.commit
rescue
db.rollback
end
I've found this technique to be a tremendous speed up.
Use Fixed schema rather then property bag, and you can use one transaction (see below link for how).
http://docs.rhomobile.com/rhodes/rhom#perfomance-tips.
This question was answer by someone else, on google groups (HAYAKAWA Takashi)

Best way to convert a Mongo query to a Ruby array?

Let's say I have a large query (for the purposes of this exercise say it returns 1M records) in MongoDB, like:
users = Users.where(:last_name => 'Smith')
If I loop through this result, working with each member, with something like:
users.each do |user|
# Some manipulation to "user"
# Some calculation for "user"
...
# Saving "user"
end
I'll often get a Mongo cursor timeout (as the database cursor that is reserved exceeds the default timeout length). I know I can extend the cursor timeout, or even turn it off--but this isn't always the most efficient method. So, one way I get around this is to change the code to:
users = Users.where(:last_name => 'Smith')
user_array = []
users.each do |u|
user_array << u
end
THEN, I can loop through user_array (since it's a Ruby array), doing manipulations and calculations, without worrying about a MongoDB timeout.
This works fine, but there has to be a better way--does anyone have a suggestion?
If your result set is so large that it causes cursor timeouts, it's not a good idea to load it entirely to RAM.
A common approach is to process records in batches.
Get 1000 users (sorted by _id).
Process them.
Get another batch of 1000 users where _id is greater than _id of last processed user.
Repeat until done.
For a long running task, consider using rails runner.
runner runs Ruby code in the context of Rails non-interactively. For instance:
$ rails runner "Model.long_running_method"
For further details, see:
http://guides.rubyonrails.org/command_line.html

Creating thousands of records in Rails

Let me set the stage: My application deals with gift cards. When we create cards they have to have a unique string that the user can use to redeem it with. So when someone orders our gift cards, like a retailer, we need to make a lot of new card objects and store them in the DB.
With that in mind, I'm trying to see how quickly I can have my application generate 100,000 Cards. Database expert, I am not, so I need someone to explain this little phenomena: When I create 1000 Cards, it takes 5 seconds. When I create 100,000 cards it should take 500 seconds right?
Now I know what you're wanting to see, the card creation method I'm using, because the first assumption would be that it's getting slower because it's checking the uniqueness of a bunch of cards, more as it goes along. But I can show you my rake task
desc "Creates cards for a retailer"
task :order_cards, [:number_of_cards, :value, :retailer_name] => :environment do |t, args|
t = Time.now
puts "Searching for retailer"
#retailer = Retailer.find_by_name(args[:retailer_name])
puts "Retailer found"
puts "Generating codes"
value = args[:value].to_i
number_of_cards = args[:number_of_cards].to_i
codes = []
top_off_codes(codes, number_of_cards)
while codes != codes.uniq
codes.uniq!
top_off_codes(codes, number_of_cards)
end
stored_codes = Card.all.collect do |c|
c.code
end
while codes != (codes - stored_codes)
codes -= stored_codes
top_off_codes(codes, number_of_cards)
end
puts "Codes are unique and generated"
puts "Creating bundle"
#bundle = #retailer.bundles.create!(:value => value)
puts "Bundle created"
puts "Creating cards"
#bundle.transaction do
codes.each do |code|
#bundle.cards.create!(:code => code)
end
end
puts "Cards generated in #{Time.now - t}s"
end
def top_off_codes(codes, intended_number)
(intended_number - codes.size).times do
codes << ReadableRandom.get(CODE_LENGTH)
end
end
I'm using a gem called readable_random for the unique code. So if you read through all of that code, you'll see that it does all of it's uniqueness testing before it ever starts creating cards. It also writes status updates to the screen while it's running, and it always sits for a while at creating. Meanwhile it flies through the uniqueness tests. So my question to the stackoverflow community is: Why is my database slowing down as I add more cards? Why is this not a linear function in regards to time per card? I'm sure the answer is simple and I'm just a moron who knows nothing about data storage. And if anyone has any suggestions, how would you optimize this method, and how fast do you think you could get it to create 100,000 cards?
(When I plotted out my times on a graph and did a quick curve fit to get my line formula, I calculated how long it would take to create 100,000 cards with my current code and it says 5.5 hours. That maybe completely wrong, I'm not sure. But if it stays on the line I curve fitted, it would be right around there.)
Not an answer to your question, but a couple of suggestions on how to make the insert faster:
Use Ruby's Hash to eliminate duplicates - using your card codes as hash keys, adding them to a hash until your hash grows to the desired size. You can also use class Set instead (but I doubt it's any faster than Hash).
Use bulk insert into the database, instead of series of INSERT queries. Most DBMS's offer the possibility: create text file with new records, and tell database to import it. Here are links for MySQL and PostgreSQL.
My first thoughts would be around transactions - if you have 100,000 pending changes waiting to be committed in the transaction that would slow things down a little, but any decent DB should be able to handle that.
What DB are you using?
What indexes are in place?
Any DB optimisations, eg clustered tables/indexes.
Not sure of the Ruby transaction support - is the #bundle.transaction line something from ActiveModel or another library you are using?

Resources