Ruby future buffer? - ruby

I want to get a bunch a XML and parse them. They are somewhat large.
So I was thinking I could get and parse them in a future like this:(I currently use Celluloid)
country_xml = {}
country_pool = GetAndParseXML.pool size: 4, args: [#connection]
countries.each do |country|
country_xml[country] = country_pool.future.fetch_xml country
end
countries.each do |country|
xml = country_xml[country]
# Do stuff with the XML!
end
This would be fine if it weren't that it takes up a lot of memory before it's actually needed.
Ideally I want it to maybe buffer up 3 XML files stop and wait until at least 1 is processed then continue. How would I do that?

The first question is: what is it that's taking up the memory? I will assume it's the prased XML documents, as that seems most likely to me.
I think the easiest way would be to create an actor that will fetch and process the XML. If you then create a pool of 3 of these actors you will have at most 3 requests being processed at once.
In vague terms (assuming that you aren't using the Celluloid registry):
class DoStuffWithCountryXml
include Celluloid
exclusive :do_stuff_with_country
def initialize(fetcher)
#fetcher = fetcher
end
def do_stuff_with_country(country)
country_xml = fetcher.fetch_xml country
# Do stuff with country_xml
end
end
country_pool = GetAndParseXML.pool size: 4, args: [#connection]
country_process_pool = DoStuffWithCountryXml.pool size: 3, args: [country_pool]
countries_futures = countries.map { |c| country_process_pool.future.do_stuff_with_country(c) }
countries_stuff = countries_futures.map { |f| f.value }
Note that if this is the only place where GetAndParseXML is used then the pool size might as well be the same as the DoStuffWithXmlActor.

I would not use a Pool at all. You're not benefiting from it. A lot of people seem to feel using a Future and a Pool together is a good idea, but it's usually worse than using one or the other.
In your case, use Future ... but you will also benefit from the upcoming Multiplexer features. Until then, do this... use a totally different strategy than has been tried or suggested:
class HandleXML
include Celluloid
def initialize(fetcher)
#fetcher = fetcher
end
def get_xml(country)
#fetcher.fetch_xml(country)
end
def process_xml(country, xml)
#de Do whatever you need to do with the data.
end
end
def begin_processor(handler, countries, index)
data = handler.future.get_xml(countries[index])
index += 1
data
end
limiter = 3 #de This sets your desired limit.
country_index = 0
data_index = 0
data = {}
processing = []
handler = HandleXML.new(#connection)
#de Load up your initial futures.
limiter.times {
processing << begin_processor(handler, countries, country_index)
}
while data_index < countries.length
data[countries[data_index]] = processor.shift.value
handler.process_xml(countries[data_index],data[countries[data_index]])
#de Once you've taken out one XML set above, load up another.
if country_index < countries.length
processing << begin_processor(handler, countries, country_index)
end
end
The above is just an example of how to do it with Future only, handling 3 at a time. I've not run it and it could have errors, but the idea is demonstrated for you.
The code loads up 3 sets of Country XML, then starts processing that XML. Once it has processed one set of XML, it loads up another, until all the country XML is processed.

Related

Dynamic Nested Ruby Loops

So, What I'm trying to do is make calls to a Reporting API to filter by all possible breakdowns (breakdown the reports by site, avertiser, ad type, campaign, etc...). But, one issue is that the breakdowns can be unique to each login.
Example:
user1: alice123's reporting breakdowns are ["site","advertiser","ad_type","campaign","line_items"]
user2: bob789's reporting breakdowns are ["campaign","position","line_items"]
When I first built the code for this reporting API, I only had one login to test with, so I hard coded the loops for the dimensions (["site","advertiser","ad_type","campaign","line_items"]). So what I did was pinged the API for a report by sites. Then for each site, pinged for advertisers, and each advertiser, I pinged for the next dimension and so on..., leaving me with a nested loop of ~6 layers.
basically what I'm doing:
sites = mechanize.get "#{base_ur}/report?dim=sites"
sites = Yajl::Parser.parse(sites.body) # json parser
sites.each do |site|
advertisers = mechanize.get "#{base_ur}/report?site=#{site.fetch("id")}&dim=advertiser"
advertisers = Yajl::Parser.parse(advertisers.body) # json parser
advertisers.each do |advertiser|
ad_types = mechanize.get "#{base_ur}/report?site=#{site.fetch("id")}&advertiser=#{advertiser.fetch("id")}&dim=ad_type"
ad_types = Yajl::Parser.parse(ad_types.body) # json parser
ad_types.each do |ad_type|
...and so on...
end
end
end
GET <api_url>/?dim=<dimension to breakdown>&site=<filter by site id>&advertiser=<filter by advertiser id>...etc...
At the end of the nested loop, I'm left with a report that's broken down as much granularity as possible.
This works now since I only thought that there was one path of breaking down, but apparently each account could have different dimensions breakdowns.
So what I'm asking is if given an array of breakdowns, how can I set up a nested loop to traverse down dynamically do the granularity singularity?
Thanks.
I'm not sure what your JSON/GET returns exactly but for a problem like this you would need recursion.
Something like this perhaps? It's not very elegant and can definitely be optimised further but should hopefully give you an idea.
some_hash = {:id=>"site-id", :body=>{:id=>"advertiser-id", :body=>{:id=>"ad_type-id", :body=>{:id=>"something-id"}}}}
#breakdowns = ["site", "advertiser", "ad_type", "something"]
def recursive(some_hash, str = nil, i = 0)
if #breakdowns[i+1].nil?
str += "#{#breakdowns[i]}=#{some_hash[:id]}"
else
str += "#{#breakdowns[i]}=#{some_hash[:id]}&dim=#{#breakdowns[i + 1]}"
end
p str
some_hash[:body].is_a?(Hash) ? recursive(some_hash[:body], str.gsub(/dim.*/, ''), i + 1) : return
end
recursive(some_hash, 'base-url/report?')
=> "base-url/report?site=site-id&dim=advertiser"
=> "base-url/report?site=site-id&advertiser=advertiser-id&dim=ad_type"
=> "base-url/report?site=site-id&advertiser=advertiser-id&ad_type=ad_type-id&dim=something"
=> "base-url/report?site=site-id&advertiser=advertiser-id&ad_type=ad_type-id&something=something-id"
If you are just looking to map your data, you can recursively map to a hash as another user pointed out. If you are actually looking to do something with this data while within the loop and want to dynamically recreate the loop structure you listed in your question (though I would advise coming up with a different solution), you can use metaprogramming as follows:
require 'active_support/inflector'
# Assume we are given an input of breakdowns
# I put 'testarr' in place of the operations you perform on each local variable
# for brevity and so you can see that the code works.
# You will have to modify to suit your needs
result = []
testarr = [1,2,3]
b = binding
breakdowns.each do |breakdown|
snippet = <<-END
eval("#{breakdown.pluralize} = testarr", b)
eval("#{breakdown.pluralize}", b).each do |#{breakdown}|
END
result << snippet
end
result << "end\n"*breakdowns.length
eval(result.join)
Note: This method is probably frowned upon, and as I've said I'm sure there are other methods of accomplishing what you are trying to do.

Can I safely use multiple mongo cursors at once in ruby?

Can you tell me if following code is safe and there is no risk that some documents may be missed?
The idea behind the code is to process multiple data sets (obtained for multiple queries), but spend at most n seconds in each collection.
Example: for two sets of data [1,2,3,4], [A,B,C,D], lets say each item processing time is 5 seconds and took_too_much_time returns true every 10 secs, I want to process 1,2, then A,B, then 3,4, then C,D.
On a small amount of data everything is working well but after switching to bigger set (tens of queries and thousands of documents), I'm missing some documents. Would be really great if ou could confirm or deny that below code may be the reason.
cursors = []
queries.each do |query|
collection.find(query, {timeout: false}) { |c| cursors << c }
end
while(cursors.any?)
cursors.each do |c|
c.each do |document|
process_document(document)
break if took_too_much_time
end
end
cursors = remove_empty_cursors(cursors)
end

Ruby/Builder API: create XML without using blocks

I'd like to use Builder to construct a set of XML files based on a table of ActiveRecord models. I have nearly a million rows, so I need to use find_each(batch_size: 5000) to iterate over the records and write an XML file for each batch of them, until the records are exhausted. Something like the following:
filecount = 1
count = 0
xml = ""
Person.find_each(batch_size: 5000) do |person|
xml += person.to_xml # pretend .to_xml() exists
count += 1
if count == MAX_PER_FILE
File.open("#{filecount}.xml", 'w') {|f| f.write(xml) }
xml = ""
filecount += 1
count = 0
end
end
This doesn't work well with Builder's interface, as it wants to work in blocks, like so:
xml = builder.person { |p| p.name("Jim") }
Once the block ends, Builder closes its current stanza; you can't keep a reference to p and use it outside of the block (I tried). Basically, Builder wants to "own" the iteration.
So to make this work with builder, I'd have to do something like:
filecount = 0
offset = 0
while offset < Person.count do
count = 0
builder = Builder::XmlMarkup.new(indent: 5)
xml = builder.people do |people|
Person.limit(MAX_PER_FILE).offset(offset).each do |person|
people.person { |p| p.name(person.name) }
count += 1
end
end
File.open("#output#file_count.xml", 'w') {|f| f.write(xml) }
filecount += 1
offset += count
end
Is there a way to use Builder without the block syntax? Is there a way to programmatically tell it "close the current stanza" rather than relying on a block?
My suggestion: don't use builder.
XML is a simple format as long as you escape the xml entities correctly.
Batch your db retrieve then just write out the batch as xml to a file handle. Don't buffer via a string as your example shows. Just write to the filehandle. Let the OS deal with buffering. Files can be of any size, why the limit?
Also, don't include the indentation spaces, with million rows, they'd add up.
Added
When writing xml files, I also include xml comments at the top of the file:
The name of the software and version that generated the xml file
Date / timestamp the file was written
Other useful info. Eg in this case you could say that the file is batch # x of the original data set.
I ended up generating the XML manually, as per Larry K's suggestion. Ruby's built-in XML encoding made this a piece of cake. I'm not sure why this feature not more widely advertised... I wasted a lot of time Googling and trying various to_xs implementations before I stumbled upon the built-in "foo".encode(xml: :text).
My code now looks like:
def run
count = 0
Person.find_each(batch_size: 5000) do |person|
open_new_file if #current_file.nil?
# simplified- I actually have many more fields and elements
#
#current_file.puts " <person>#{person.name.encode(xml: :text)}</person>"
count += 1
if count == MAX_PER_FILE
close_current_file
count = 0
end
end
close_current_file
end
def open_new_file
#file_count += 1
#current_file = File.open("people#{#file_count}.xml", 'w')
#current_file.puts "<?xml version='1.0' encoding='UTF-8'?>"
#current_file.puts " <people>"
end
def close_current_file
unless #current_file.nil?
#current_file.puts " </people>"
#current_file.close
#current_file = nil
end
end

ruby - can't create Thread (35) (ThreadError)

I'm very new to ruby and I'm learning how to do process in multiple threads. What I do is parse a 170mb xml file using Nokogiri and I'm putting the database(Postgresql) insert inside a new thread inside my .each(). Please suggest a better approach in handling this very large file and doing it in multiple threads. Here's what I have so far.
conn = PGconn.connect("localhost", 5432, "", "", "oaxis","postgres","root")
f = File.open("metadata.xml")
doc = Nokogiri::XML(f)
counter = 0
threadArray = []
doc.xpath('//Title').each do |node|
threadArray[counter] = Thread.new{
titleVal = node.text
random_string = (0...10).map{ ('a'..'z').to_a[rand(26)] }.join
conn.prepare('ins'+random_string, 'insert into sample_tbl (title) values ($1)')
conn.exec_prepared('ins'+random_string, [titleVal])
puts titleVal+" ==>"+random_string+ " \n"
counter += 1
}
end
threadArray.each {|t| t.join}
f.close
What you are doing will not result in the data being inserted faster into the database, compared to the singlethreaded case. MRI Ruby has a global interpreter lock and will only ever run a single thread at a time. Using threads in MRI Ruby only improves performance when the threads are performing IO actions (or waiting to be able to do so) and program progress does not depend on the results of those IO actions (so you don't actively wait for them).
I advise you to stop using Threads here and instead calculate all the values you wish to insert and them mass insert them. The code will also be simpler to understand and reason about. Even inserting them one by one from a single thread will be faster, but there's no reason to do that.

Database locking: ActiveRecord + Heroku

I'm building a Sinatra based app for deployment on Heroku. You can imagine it like a standard URL shortener but where old shortcodes expire and become available for new URLs (I realise this is a silly concept but its easier to explain this way). I'm representing the shortcode in my database as an integer and redefining its reader to give a nice short and unique string from the integer.
As some rows will be deleted, I've written code that goes thru all the shortcode integers and picks the first free one to use just before_save. Unfortunately I can make my code create two rows with identical shortcode integers if I run two instances very quickly one after another, which is obviously no good! How should I implement a locking system so that I can quickly save my record with a unique shortcode integer?
Here's what I have so far:
Chars = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a
CharLength = Chars.length
class Shorts < ActiveRecord::Base
before_save :gen_shortcode
after_save :done_shortcode
def shortcode
i = read_attribute(:shortcode).to_i
return '0' if i == 0
s = ''
while i > 0
s << Chars[i.modulo(CharLength)]
i /= 62
end
s
end
private
def gen_shortcode
shortcode = 0
self.class.find(:all,:order=>"shortcode ASC").each do |s|
if s.read_attribute(:shortcode).to_i != shortcode
# Begin locking?
break
end
shortcode += 1
end
write_attribute(:shortcode,shortcode)
end
def done_shortcode
# End Locking?
end
end
This line:
self.class.find(:all,:order=>"shortcode ASC").each
will do a sequential search over your entire record collection. You'd have to lock the entire table so that, when one of your processes is scanning for the next integer, the others will wait for the first one to finish. This will be a performance killer. My suggestion, if possible, is for the process to be as follows:
Add a column that indicates when a record has expired (do you expire them by time of creation? last use?). Index this column.
When you need to find the next lowest usable number, do something like
Shorts.find(:conditions => {:expired => true},:order => 'shortcode')
This will have the database doing the hard work of finding the lowest expired shortcode. Recall that, in the absence of the :all parameter, the find method will only return the first matching record.
Now, in order to prevent race conditions between processes, you can wrap this in a transaction and lock while doing the search:
Shorts.transaction do
Shorts.find(:conditions => {:expired => true},:order => 'shortcode', :lock => true)
#Do your thing here. Be quick about it, the row is locked while you work.
end #on ending the transaction the lock is released
Now when a second process starts looking for a free shortcode, it will not read the one that's locked (so presumably it will find the next one). This is because the :lock => true parameter gets an exclusive lock (both read/write).
Check this guide for more on locking with ActiveRecord.

Resources