Ruby/Builder API: create XML without using blocks - ruby

I'd like to use Builder to construct a set of XML files based on a table of ActiveRecord models. I have nearly a million rows, so I need to use find_each(batch_size: 5000) to iterate over the records and write an XML file for each batch of them, until the records are exhausted. Something like the following:
filecount = 1
count = 0
xml = ""
Person.find_each(batch_size: 5000) do |person|
xml += person.to_xml # pretend .to_xml() exists
count += 1
if count == MAX_PER_FILE
File.open("#{filecount}.xml", 'w') {|f| f.write(xml) }
xml = ""
filecount += 1
count = 0
end
end
This doesn't work well with Builder's interface, as it wants to work in blocks, like so:
xml = builder.person { |p| p.name("Jim") }
Once the block ends, Builder closes its current stanza; you can't keep a reference to p and use it outside of the block (I tried). Basically, Builder wants to "own" the iteration.
So to make this work with builder, I'd have to do something like:
filecount = 0
offset = 0
while offset < Person.count do
count = 0
builder = Builder::XmlMarkup.new(indent: 5)
xml = builder.people do |people|
Person.limit(MAX_PER_FILE).offset(offset).each do |person|
people.person { |p| p.name(person.name) }
count += 1
end
end
File.open("#output#file_count.xml", 'w') {|f| f.write(xml) }
filecount += 1
offset += count
end
Is there a way to use Builder without the block syntax? Is there a way to programmatically tell it "close the current stanza" rather than relying on a block?

My suggestion: don't use builder.
XML is a simple format as long as you escape the xml entities correctly.
Batch your db retrieve then just write out the batch as xml to a file handle. Don't buffer via a string as your example shows. Just write to the filehandle. Let the OS deal with buffering. Files can be of any size, why the limit?
Also, don't include the indentation spaces, with million rows, they'd add up.
Added
When writing xml files, I also include xml comments at the top of the file:
The name of the software and version that generated the xml file
Date / timestamp the file was written
Other useful info. Eg in this case you could say that the file is batch # x of the original data set.

I ended up generating the XML manually, as per Larry K's suggestion. Ruby's built-in XML encoding made this a piece of cake. I'm not sure why this feature not more widely advertised... I wasted a lot of time Googling and trying various to_xs implementations before I stumbled upon the built-in "foo".encode(xml: :text).
My code now looks like:
def run
count = 0
Person.find_each(batch_size: 5000) do |person|
open_new_file if #current_file.nil?
# simplified- I actually have many more fields and elements
#
#current_file.puts " <person>#{person.name.encode(xml: :text)}</person>"
count += 1
if count == MAX_PER_FILE
close_current_file
count = 0
end
end
close_current_file
end
def open_new_file
#file_count += 1
#current_file = File.open("people#{#file_count}.xml", 'w')
#current_file.puts "<?xml version='1.0' encoding='UTF-8'?>"
#current_file.puts " <people>"
end
def close_current_file
unless #current_file.nil?
#current_file.puts " </people>"
#current_file.close
#current_file = nil
end
end

Related

In Ruby, what is the best way to convert alphanumeric entries to integers for a column of a CSV containing a huge number of rows?

My CSV contains about 60 million rows. The 10th column contains some alphanumeric entries, some of which repeat, that I want to convert into integers with a one-to-one mapping. That is, I don't want the same entry in Original.csv to have multiple corresponding integer values in Processed.csv. So, initially, I wrote the following code:
require 'csv'
udids = []
CSV.open('Original.csv', "wb") do |csv|
CSV.foreach('Processed.csv', :headers=>true) do |row|
unless udids.include?(row[9])
udids << row[9]
end
udid = udids.index(row[9]) + 1
array = [udid]
csv<<array
end
end
But, the program was taking a lot of time, which I soon realized was because it had to check all the previous rows to make sure only the new values get assigned a new integer value, and the existing ones are not assigned any new value.
So, I thought of hashing them, because when exploring the web about this issue, I learnt that hashing is faster than sequential comparing, somehow (I have not read the details about the how, but anyway...) So, I wrote the following code to hash them:
arrayUDID=[]
arrayUser=[]
arrayHash=[]
array1=[]
f = File.open("Original.csv", "r")
f.each_line { |line|
row = line.split(",");
arrayUDID<<row[9]
arrayUser<<row[9]
}
arrayUser = arrayUser.uniq
arrayHash = []
for i in 0..arrayUser.size-1
arrayHash<<arrayUser[i]
arrayHash<<i
end
hash = Hash[arrayHash.each_slice(2).to_a]
array1=hash.values_at *arrayUDID
logfile = File.new("Processed.csv","w")
for i in 0..array1.size-1
logfile.print("#{array1[i]}\n")
end
logfile.close
But here again, I observed that the program was taking a lot of time, which I realized must be due to the hash array (or hash table) running out of memory.
So, can you kindly suggest any method that will work for my huge file in a reasonable amount of time? By reasonable amount, I mean within 10 hours, because I realize that it's going to take some hours at least as it took about 5 hours to extract that dataset from an even bigger dataset. So, with my aforementioned codes, it was not getting finished even after 2 days of running the programs. So, if you can suggest a method which can do the task by leaving the computer on overnight, that would be great. Thanks.
I think this should work:
udids = {}
unique_count = 1
output_csv = CSV.open("Processed.csv", "w")
CSV.foreach("Original.csv").with_index do |row, i|
output_csv << row and next if i == 0 # skip first row (header info)
val = row[9]
if udids[val.to_sym]
row[9] = udids[val.to_sym]
else
udids[val.to_sym] = unique_count
row[9] = unique_count
unique_count += 1
end
output_csv << row
end
output_csv.close
The performance depends heavily on how many duplicates there are (the more the better), but basically it keeps track of each value as a key in a hash, and checks to see if it has encountered that value yet or not. If so, it uses the corresponding value, and if not it increments a counter, stores that count as the new value for that key and continues.
I was able to process a 10 million line test CSV file in about 3 minutes.

Ruby future buffer?

I want to get a bunch a XML and parse them. They are somewhat large.
So I was thinking I could get and parse them in a future like this:(I currently use Celluloid)
country_xml = {}
country_pool = GetAndParseXML.pool size: 4, args: [#connection]
countries.each do |country|
country_xml[country] = country_pool.future.fetch_xml country
end
countries.each do |country|
xml = country_xml[country]
# Do stuff with the XML!
end
This would be fine if it weren't that it takes up a lot of memory before it's actually needed.
Ideally I want it to maybe buffer up 3 XML files stop and wait until at least 1 is processed then continue. How would I do that?
The first question is: what is it that's taking up the memory? I will assume it's the prased XML documents, as that seems most likely to me.
I think the easiest way would be to create an actor that will fetch and process the XML. If you then create a pool of 3 of these actors you will have at most 3 requests being processed at once.
In vague terms (assuming that you aren't using the Celluloid registry):
class DoStuffWithCountryXml
include Celluloid
exclusive :do_stuff_with_country
def initialize(fetcher)
#fetcher = fetcher
end
def do_stuff_with_country(country)
country_xml = fetcher.fetch_xml country
# Do stuff with country_xml
end
end
country_pool = GetAndParseXML.pool size: 4, args: [#connection]
country_process_pool = DoStuffWithCountryXml.pool size: 3, args: [country_pool]
countries_futures = countries.map { |c| country_process_pool.future.do_stuff_with_country(c) }
countries_stuff = countries_futures.map { |f| f.value }
Note that if this is the only place where GetAndParseXML is used then the pool size might as well be the same as the DoStuffWithXmlActor.
I would not use a Pool at all. You're not benefiting from it. A lot of people seem to feel using a Future and a Pool together is a good idea, but it's usually worse than using one or the other.
In your case, use Future ... but you will also benefit from the upcoming Multiplexer features. Until then, do this... use a totally different strategy than has been tried or suggested:
class HandleXML
include Celluloid
def initialize(fetcher)
#fetcher = fetcher
end
def get_xml(country)
#fetcher.fetch_xml(country)
end
def process_xml(country, xml)
#de Do whatever you need to do with the data.
end
end
def begin_processor(handler, countries, index)
data = handler.future.get_xml(countries[index])
index += 1
data
end
limiter = 3 #de This sets your desired limit.
country_index = 0
data_index = 0
data = {}
processing = []
handler = HandleXML.new(#connection)
#de Load up your initial futures.
limiter.times {
processing << begin_processor(handler, countries, country_index)
}
while data_index < countries.length
data[countries[data_index]] = processor.shift.value
handler.process_xml(countries[data_index],data[countries[data_index]])
#de Once you've taken out one XML set above, load up another.
if country_index < countries.length
processing << begin_processor(handler, countries, country_index)
end
end
The above is just an example of how to do it with Future only, handling 3 at a time. I've not run it and it could have errors, but the idea is demonstrated for you.
The code loads up 3 sets of Country XML, then starts processing that XML. Once it has processed one set of XML, it loads up another, until all the country XML is processed.

Ruby Search Array And Replace String

My question is, how can I search through an array and replace the string at the current index of the search without knowing what the indexed array string contains?
The code below will search through an ajax file hosted on the internet, it will find the inventory, go through each weapon in my inventory, adding the ID to a string (so I can check if that weapon has been checked before). Then it will add another value after that of the amount of times it occurs in the inventory, then after I have check all weapon in the inventory, it will go through the all of the IDs added to the string and display them along with the number (amount of occurrences). This is so I know how many of each weapon I have.
This is an example of what I have:
strList = ""
inventory.each do |inv|
amount = 1
exists = false
ids = strList.split(',')
ids.each do |ind|
if (inv['id'] == ind.split('/').first) then
exists = true
amount = ind.split('/').first.to_i
amount += 1
ind = "#{inv['id']}/#{amount.to_s}" # This doesn't seem work as expected.
end
end
if (exists == true) then
ids.push("#{inv['id']}/#{amount.to_s}")
strList = ids.join(",")
end
end
strList.split(",").each do |item|
puts "#{item.split('/').first} (#{item.split('/').last})"
end
Here is an idea of what code I expected (pseudo-code):
inventory = get_inventory()
drawn_inv = ""
loop.inventory do |inv|
if (inv['id'].occurred_before?)
inv['id'].count += 1
end
end loop
loop.inventory do |inv|
drawn_inv.add(inv['id'] + "/" + inv['id'].count)
end loop
loop.drawn_inv do |inv|
puts "#{inv}"
end loop
Any help on how to replace that line is appreciated!
EDIT: Sorry for not requiring more information on my code. I skipped the less important part at the bottom of the code and displayed commented code instead of actual code, I'll add that now.
EDIT #2: I'll update my description of what it does and what I'm expecting as a result.
EDIT #3: Added pseudo-code.
Thanks in advance,
SteTrezla
You want #each_with_index: http://ruby-doc.org/core-2.2.0/Enumerable.html#method-i-each_with_index
You may also want to look at #gsub since it takes a block. You may not need to split this string into an array at all. Basically something like strList.gsub(...){ |match| #...your block }

Counting words in JSON file with Ruby

What is the best way to count words in a JSON file with Ruby?
The scan method will do the job but spends a lot of memory.
Try the block version of scan:
count = 0
json_string.scan(/\w+/) { count += 1 }
If you don't want to read the whole file into memory at once:
count = 0
File.new("test.json").each_line do |line|
line.scan(/\w+/) { count += 1 }
end
This assumes of course that your JSON file is formatted (using prettify_json.rb, for instance.) It won't do much good if everything is on a single line, obviously.

Ruby, how should I design a parser?

I'm writing a small parser for Google and I'm not sure what's the best way to design it. The main problem is the way it will remember the position it stopped at.
During parsing it's going to append new searches to the end of a file and go through the file startig with the first line. Now I want to do it so, that if for some reason the execution is interrupted, the script knows the last search it has accomplished successfully.
One way is to delete a line in a file after fetching it, but in this case I have to handle order that threads access file and deleting first line in a file afaik can't be done processor-effectively.
Another way is to write the number of used line to a text file and skip the lines whose numbers are in that file. Or maybe I should use some database instead? TIA
There's nothing wrong with using a state file. The only catch will be that you need to ensure you have fully committed your changes to the state file before your program enters a section where it may be interrupted. Typically this is done with an IO#flush call.
For example, here's a simple state-tracking class that works on a line-by-line basis:
class ProgressTracker
def initialize(filename)
#filename = filename
#file = open(#filename)
#state_filename = File.expand_path(".#{File.basename(#filename)}.position", File.dirname(#filename))
if (File.exist?(#state_filename))
#state_file = open(#state_filename, File::RDWR)
resume!
else
#state_file = open(#state_filename, File::RDWR | File::CREAT)
end
end
def each_line
#file.each_line do |line|
mark_position!
yield(line) if (block_given?)
end
end
protected
def mark_position!
#state_file.rewind
#state_file.puts(#file.pos)
#state_file.flush
end
def resume!
if (position = #state_file.readline)
#file.seek(position.to_i)
end
end
end
You use it with an IO-like block call:
test = ProgressTracker.new(__FILE__)
n = 0
test.each_line do |line|
n += 1
puts "%3d %s" % [ n, line ]
if (n == 10)
raise 'terminate'
end
end
In this case, the program reads itself and will stop after ten lines due to a simulated error. On the second run it should display the next ten lines, if there are that many, or simply exit if there's no additional data to retrieve.
One caveat is that you need to remove the .position file associated with the input data if you want the file to be reprocessed, or if the file has been reset. It's also not possible to edit the file and remove earlier lines or it will throw off the offset tracking. So long as you're simply appending data to the file, or restarting it, everything will be fine.

Resources