Is there a way to do the equivalent of ActiveRecord#find_each in DataMapper ?
(find_each will iterate over the result of a query by fetching things in memory by batch of 1000 rather than loading everything in memory)
I checked dm-chunked_query as #MichaelKohl suggested, but I couldn't make it work as I'd expect, it gets the whole collection (I'd expect it to use OFFSET+LIMIT). So I wrote my own extension, it's pretty simple, hope it helps:
class DataMapper::Collection
def batch(n)
Enumerator.new do |y|
offset = 0
loop do
records = slice(offset, n)
break if records.empty?
records.each { |record| y.yield(record) }
offset += records.size
end
end
end
end
# Example
Model.all(:order => :id.asc).batch(1000).each { |obj| p obj }
I don't DM much, but it would not be that hard to write your own, assuming DM lets you apply your own 'limit'and 'offset' manually to queries.
Check out the implementation of find_each/find_in_batches in AR, only a couple dozen lines.
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L19
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L48
Related
I was trying to make my bubble sort shorter and I came up with this
class Array
def bubble_sort!(&block)
block = Proc.new { |a, b| a <=> b } unless block_given?
sorted = each_index.each_cons(2).none? do |i, next_i|
if block.call(self[i], self[next_i]) == 1
self[i], self[next_i] = self[next_i], self[i]
end
end until sorted
self
end
def bubble_sort(&prc)
self.dup.bubble_sort!(&prc)
end
end
I don't particularly like the thing with sorted = --sort code-- until sorted.
I just want to run the each_index.each_cons(s).none? code until it returns true. It's a weird situation that I use until, but the condition is a code I want to run. Any way, my try seems awkward, and ruby usually has a nice concise way of putting things. Is there a better way to do this?
This is just my opinion
have you ever read the ruby source code of each and map to understand what they do?
No, because they have a clear task expressed from the method name and if you test them, they will take an object, some parameters and then return a value to you.
For example if I want to test the String method split()
s = "a new string"
s.split("new")
=> ["a ", " string"]
Do you know if .split() takes a block?
It is one of the core ruby methods, but to call it I don't pass a block 90% of the times, I can understand what it does from the name .split() and from the return value
Focus on the objects you are using, the task the methods should accomplish and their return values.
I read your code and I can not refactor it, I hardly can understand what the code does.
I decided to write down some points, with possibility to follow up:
1) do not use the proc for now, first get the Object Oriented code clean.
2) split bubble_sort! into several methods, each one with a clear task
def ordered_inverted! (bubble_sort!), def invert_values, maybe perform a invert_values until sorted, check if existing methods already perform this sorting functionality
3) write specs for those methods, tdd will push you to keep methods simple and easy to test
4) If those methods do not belong to the Array class, include them in the appropriate class, sometimes overly complicated methods are just performing simple String operations.
5) Reading books about refactoring may actually help more then trying to force the usage of proc and functional programming when not necessary.
After looking into it further I'm fairly sure the best solution is
loop do
break if condition
end
Either that or the way I have it in the question, but I think the loop do version is clearer.
Edit:
Ha, a couple weeks later after I settled for the loop do solution, I stumbled into a better one. You can just use a while or until loop with an empty block like this:
while condition; end
until condition; end
So the bubble sort example in the question can be written like this
class Array
def bubble_sort!(&block)
block = Proc.new { |a, b| a <=> b } unless block_given?
until (each_index.each_cons(2).none? do |i, next_i|
if block.call(self[i], self[next_i]) == 1
self[i], self[next_i] = self[next_i], self[i]
end
end); end
self
end
def bubble_sort(&prc)
self.dup.bubble_sort!(&prc)
end
end
I have a big Array of AR model instances. Let's say there are 20K entries in the array. I want to move through that array a chunk of 1,000 at a time.
slice_size = 1000
start = 0
myarray.slice(start, slice_size) do |slice|
slice.each do |item|
item.dostuff
end
start+=slice_size
end
I can replace that whole inner block with just:
puts "hey"
and not see a thing in the console. I have tried this 9 ways from Sunday. And I've done it successfully before, just can't remember where. And I have RTFM. Can anyone help?
The problem is that slice does not take a block, but you are passing it a block, and trying to do something in it, which is ignored. If you do
myarray.slice(start, slice_size).each do |slice|
...
end
then it should work.
But to do it that way is not Ruby-ish. A better way is
myarray.each_slice(slice_size) do |slice|
...
end
If the array can be destroyed, you could do it like this:
((myarray.size+slice_size-1)/slice_size).times.map {myarray.shift(slice_size)}
If not:
((myarray.size+slice_size-1)/slice_size).times.map { |i|
myarray.slice(i*slice_size, slice_size) }
You can use:
Enumerable#each_slice(n) which takes n items at a time;
Array#in_groups_of(n) (if this is Rails) which works like each_slice but will pad the last group to guarantee the group size remains constant;
But I recommend using ActiveRecord's built-in Model.find_each which will batch queries in the DB layer for better performance. It defaults to 1000, but you can specify the batch size. See http://guides.rubyonrails.org/active_record_querying.html#retrieving-multiple-objects-in-batches for more detail.
Example from the guide:
User.find_each(batch_size: 5000) do |user|
NewsLetter.weekly_deliver(user)
end
In Ruby, I have an array of simple values (possible encodings):
encodings = %w[ utf-8 iso-8859-1 macroman ]
I want to keep reading a file from disk until the results are valid. I could do this:
good = encodings.find{ |enc| IO.read(file, "r:#{enc}").valid_encoding? }
contents = IO.read(file, "r:#{good}")
...but of course this is dumb, since it reads the file twice for the good encoding. I could program it in gross procedural style like so:
contents = nil
encodings.each do |enc|
if (s=IO.read(file, "r:#{enc}")).valid_encoding?
contents = s
break
end
end
But I want a functional solution. I could do it functionally like so:
contents = encodings.map{|e| IO.read(f, "r:#{e}")}.find{|s| s.valid_encoding? }
…but of course that keeps reading files for every encoding, even if the first was valid.
Is there a simple pattern that is functional, but does not keep reading the file after a the first success is found?
If you sprinkle a lazy in there, map will only consume those elements of the array that are used by find - i.e. once find stops, map stops as well. So this will do what you want:
possible_reads = encodings.lazy.map {|e| IO.read(f, "r:#{e}")}
contents = possible_reads.find {|s| s.valid_encoding? }
Hopping on sepp2k's answer: If you can't use 2.0, lazy enums can be easily implemented in 1.9:
class Enumerator
def lazy_find
self.class.new do |yielder|
self.each do |element|
if yield(element)
yielder.yield(element)
break
end
end
end
end
end
a = (1..100).to_enum
p a.lazy_find { |i| i.even? }.first
# => 2
You want to use the break statement:
contents = encodings.each do |e|
s = IO.read( f, "r:#{e}" )
s.valid_encoding? and break s
end
The best I can come up with is with our good friend inject:
contents = encodings.inject(nil) do |s,enc|
s || (c=File.open(f,"r:#{enc}").valid_encoding? && c
end
This is still sub-optimal because it continues to loop through encodings after finding a match, though it doesn't do anything with them, so it's a minor ugliness. Most of the ugliness comes from...well, the code itself. :/
What's the most efficient way to iterate through an entire table using Datamapper?
If I do this, does Datamapper try to pull the entire result set into memory before performing the iteration? Assume, for the sake of argument, that I have millions of records and that this is infeasible:
Author.all.each do |a|
puts a.title
end
Is there a way that I can tell Datamapper to load the results in chunks? Is it smart enough to know to do this automatically?
Thanks, Nicolas, I actually came up with a similar solution. I've accepted your answer since it makes use of Datamapper's dm-pagination system, but I'm wondering if this would do equally as well (or worse):
while authors = Author.slice(offset, CHUNK) do
authors.each do |a|
# do something with a
end
offset += CHUNK
end
Datamapper will run just one sql query for the example above so it will have to keep the whole result set in memory.
I think you should use some sort of pagination if your collection is big.
Using dm-pagination you could do something like:
PAGE_SIZE = 20
pager = Author.page(:per_page => PAGE_SIZE).pager # This will run a count query
(1..pager.total_pages).each do |page_number|
Author.page(:per_page => PAGE_SIZE, :page => page_number).each do |a|
puts a.title
end
end
You can play around with different values for PAGE_SIZE to find a good trade-off between the number of sql queries and memory usage.
What you want is the dm-chunked_query plugin: (example from the docs)
require 'dm-chunked_query'
MyModel.each_chunk(20) do |chunk|
chunk.each do |resource|
# ...
end
end
This will allow you to iterate over all the records in the model, in chunks of 20 records at a time.
EDIT: the example above had an extra #each after #each_chunk, and it was unnecessary. The gem author updated the README example, and I changed the above code to match.
I'm just starting out using Ruby and I've written a bit of code to do basic parsing of a CSV file (Line is a basic class, omitted for brevity):
class File
def each_csv
each do |line|
yield line.split(",")
end
end
end
lines = Array.new
File.open("some.csv") do |file|
file.each_csv do |csv|
lines << Line.new(:field1 => csv[0], :field2 => csv[1])
end
end
I have a feeling I would be better off using collect somehow rather than pushing each Line onto the array but I can't work out how to do it.
Can anyone show me how to do it or is it perfectly fine as it is?
Edit: I should have made it clear that I'm not actually going to use this code in production, it's more to get used to the constructs of the language. It is still useful to know there are libraries to do this properly though.
Here's a (possibly wild) idea, use the Struct class instead of rolling your own simple POD class. But what you want from this is to have a constructor that accepts all of the arguments that could be generated from the file data.
Line = Struct.new(:field1, :field2, :field3)
Then at the core of the algorithm you want something like:
File.open("test.csv").lines.inject([]) do |result, line|
result << Line.new(line.split(",", Line.length))
end
or being a bit more concise and functional-like:
lines = File.open("test.csv").lines.map { |line| Line.new(line.split(",", Line.length)) }
To be honest I haven't used the Struct class much, but I should be, and I will probably refactor stuff already written to use it. It allows you to access the variables by their names like:
Line.field1 = blah
Line.field2 = 1
The Ruby Struct class.
So to actually answer your question, and looking above at the code, I would say it would be much simpler to use collect/map to perform the computation. The map function together with inject are very powerful and I find I use them quite frequently.
I don't know if you are aware of it, but ruby has it's own class for parsing and writing CSV files.
I found an example of using collect to turn a csv file into an array of hashes.
def csv_to_array(file_location)
csv = CSV::parse(File.open(file_location, 'r') {|f| f.read })
fields = csv.shift
csv.collect { |record| Hash[*(0..(fields.length - 1)).collect {|index| [fields[index],record[index].to_s] }.flatten ] }
end
This example is taken from this article.
If you are unfamiliar with the * notion, it basically dissolves the outer [] brackets, turning an array into a comma separated list of its elements.
Have you looked at FasterCSV, it does what your trying to do here, along with dealing with some of the brain deadness you find in some CSV files
See how this works for you (functional programming is fun!):
Try using inject. Inject takes as a parameter the starting "accumulator", and then a two parameter block:
[1,2,3].inject(0) { |sum,num| sum+num }
is naturally 6
[1,2,3].inject(5) { |sum,num| sum+num }
is 11
[1,2,3].inject(2) { |sum,num| sum*num }
is 12
To the point:
class Line
def initialize(options)
#options = options
end
def to_s
#options[:field1]+" "+#options[:field2]
end
end
File.open("test.csv").lines.inject([]) do |lines,line|
split = line.split(",")
lines << Line.new(:field1 => split[0],:field2 => split[1])
end