Ruby, how should I design a parser? - ruby

I'm writing a small parser for Google and I'm not sure what's the best way to design it. The main problem is the way it will remember the position it stopped at.
During parsing it's going to append new searches to the end of a file and go through the file startig with the first line. Now I want to do it so, that if for some reason the execution is interrupted, the script knows the last search it has accomplished successfully.
One way is to delete a line in a file after fetching it, but in this case I have to handle order that threads access file and deleting first line in a file afaik can't be done processor-effectively.
Another way is to write the number of used line to a text file and skip the lines whose numbers are in that file. Or maybe I should use some database instead? TIA

There's nothing wrong with using a state file. The only catch will be that you need to ensure you have fully committed your changes to the state file before your program enters a section where it may be interrupted. Typically this is done with an IO#flush call.
For example, here's a simple state-tracking class that works on a line-by-line basis:
class ProgressTracker
def initialize(filename)
#filename = filename
#file = open(#filename)
#state_filename = File.expand_path(".#{File.basename(#filename)}.position", File.dirname(#filename))
if (File.exist?(#state_filename))
#state_file = open(#state_filename, File::RDWR)
resume!
else
#state_file = open(#state_filename, File::RDWR | File::CREAT)
end
end
def each_line
#file.each_line do |line|
mark_position!
yield(line) if (block_given?)
end
end
protected
def mark_position!
#state_file.rewind
#state_file.puts(#file.pos)
#state_file.flush
end
def resume!
if (position = #state_file.readline)
#file.seek(position.to_i)
end
end
end
You use it with an IO-like block call:
test = ProgressTracker.new(__FILE__)
n = 0
test.each_line do |line|
n += 1
puts "%3d %s" % [ n, line ]
if (n == 10)
raise 'terminate'
end
end
In this case, the program reads itself and will stop after ten lines due to a simulated error. On the second run it should display the next ten lines, if there are that many, or simply exit if there's no additional data to retrieve.
One caveat is that you need to remove the .position file associated with the input data if you want the file to be reprocessed, or if the file has been reset. It's also not possible to edit the file and remove earlier lines or it will throw off the offset tracking. So long as you're simply appending data to the file, or restarting it, everything will be fine.

Related

ActiveRecord. Is a transaction recommended here. How can I ensure all records are saved

I have the following.
files.each do |file_path|
filename = file_path_to_file_name(file_path)
feature = Feature.new
feature.name = filename_to_name(filename)
feature.filename = filename
feature.suite_id = suite.id
feature.save!
end
I want to make sure every feature gets saved, but if an exception is thrown I don't want to just stop. I don't want to just fail silently and move on as well, which is why I'm using ! with save.
First off: Is this a case where I should be using a transaction (as in the following)?:
found_tests.each do |file_path|
Feature.transaction do
filename = file_path_to_file_name(file_path)
feature = Feature.new
feature.name = filename_to_name(filename)
feature.filename = filename
feature.suite_id = suite.id
feature.save!
end
end
Second, what are good ways to ensure everything gets saved? What risks are there when I'm regularly working on the order of less than 10,000 things being saved each time this particular script runs.
If you have a validation, whatever you use save or save!, data won't persist. You would usually use .transaction when you are doing multiple updates, and say the last one fails, so you have to rollback previous updates as well. Your case is not such a case.
There are multiple ways to go around to make sure you've saved everything. Everytime you fail, you could keep a Redis queue or something, so you can investigate why some features didn't save and then attempt to fix and then save them again.
If you want to continue running your script, you'd want to rescue your exception and call next to move on to the next file_path
found_tests.each do |file_path|
begin
filename = file_path_to_file_name(file_path)
feature = Feature.new
feature.name = filename_to_name(filename)
feature.filename = filename
feature.suite_id = suite.id
feature.save!
rescue YourError => e
puts e
next
end
end

Ruby skips a section of script

Ruby noob here, and I am having an infuriating issue.
Something in my script is making Ruby skip a section of it.
#Main_block_begins_here
if __FILE__ == $0
#Open_the_file_with_values_and_weights
File.open(getFileName, 'r') do |f|
#Intial_data...
$totalNumber=f.gets.chomp!.to_i
$totalCapacity=f.gets.chomp!.to_i
$currentItemWeight=0
$currentItemValue=0
$spaceLeft=0
$spaceLeft=$totalCapacity
$takenWeight=0
$takenValue=0
#Make_a_heap_for_the_data
$prioQ = Heap.new do
#reads_values,_computes_them,_and_enqueues_them
until i==$totalNumber
total=0
value=f.gets.chomp!.to_i
weight=f.gets.chomp!.to_i
total=value/weight
puts "#{total}=t, #{value}=v, #{weight}=w. Correct?"
check=gets.chomp!
it=Obj.new(total, value, weight) do
prioQ.enqueue(it)
end
i+=1
end
end
#dequeueIt
puts "total weight taken was #{$takenWeight} and total value taken was #{$takenValue}."
end
end
The commented line #dequeueIt is a method earlier on that when I let run, it gives me an infinite loop with all values that were supposed to be read in from the text file as zeros.
The puts line and the check declaration line inside the until loop are for debugging purposes, and of course they never print out.
Commented out, when I run the program it just prints out the last line as if the until loop never ran. If more code for context is necessary, just let me know and I'll put it up.
My hair's starting to fall out over this one, so any help is appreciated!
EDIT:: Yes, it should have been i==$totalnumber. I fixed it in my code and it still doesn't execute the script inside that loop.
This is one oversight:
until i=$totalNumber
This assigns a number to i; it is always true. Try
until i==$totalNumber

How can a value be accumulated when run in parallel/concurent processes?

I'm running some Ruby scripts concurrently using Grosser/Parallel.
During each concurrent test I want to add up the number of times a particular thing has happened, then display that number.
Let's say:
def main
$this_happened = 0
do_this_in_parallel
puts $this_happened
end
def do_this_in_parallel
Parallel.each(...) {
$this_happened += 1
}
end
The final value after do_this_in_parallel has finished will always be 0
I'd like to know why this happens.
How can I get the desired result which would be $this_happenend > 0?
Thanks.
This doesn't work because separate processes have separate memory spaces: setting variables etc in one process has no effect on what happens in the other process.
However you can return a result from your block (because under the hood parallel sets up pipes so that the processes can be fed input/return results). For example you could do this
counts = Parallel.map(...) do
#the return value of the block should
#be the number of times the event occurred
end
Then just sum the counts to get your total count (eg counts.reduce(:+)). You might also want to read up on map-reduce for more information about this way of parallelising work
I have never used parallel but the documentation seems to suggest that something like this might work.
Parallel.each(..., :finish => lambda {|*_| $this_happened += 1}) { do_work }

refactor this ruby database/sequel gem lookup

I know this code is not optimal, any ideas on how to improve it?
job_and_cost_code_found = false
timberline_db['SELECT Job, Cost_Code FROM [JCM_MASTER__COST_CODE] WHERE [Job] = ? AND [Cost_Code] = ?', job, clean_cost_code].each do |row|
job_and_cost_code_found = true
end
if job_and_cost_code_found == false then
info = linenum + "," + id + ",,Employees default job and cost code do not exist in timberline. job:#{job} cost code:#{clean_cost_code}"
add_to_exception_output_file(info)
end
You're breaking a lot of simple rules here.
Don't select what you don't use.
You select a number of columns, then completely ignore the result data. What you probably want is a count:
SELECT COUNT(*) AS cost_code_count FROM [JCM_MASTER__COST_CODE] WHERE [Job] = ? AND [Cost_Code] = ?'
Then you'll get one row that will have either a zero or non-zero value in it. Save this into a variable like:
job_and_cost_codes_found = timberline_db[...][0]['cost_code_count']
Don't compare against false unless you need to differentiate between that and nil
In Ruby only two things evaluate as false, nil and false. Most of the time you will not be concerned about the difference. On rare occasions you might want to have different logic for set true, set false or not set (nil), and only then would you test so specifically.
However, keep in mind that 0 is not a false value, so you will need to compare against that.
Taking into account the previous optimization, your if could be:
if job_and_cost_codes_found == 0
# ...
end
Don't use then or other bits of redundant syntax
Most Ruby style-guides spurn useless syntax like then, just as they recommend avoiding for and instead use the Enumerable class which is far more flexible.
Manipulate data, not strings
You're assembling some kind of CSV-like line in the end there. Ideally you'd be using the built-in CSV library to do the correct encoding, and libraries like that want data, not a string they'd have to parse.
One step closer to that is this:
line = [
linenum,
id,
nil,
"Employees default job and cost code do not exist in timberline. job:#{job} cost code:#{clean_cost_code}"
].join(',')
add_to_exception_output_file(line)
You'd presumably replace join(',') with the proper CSV encoding method that applies here. The library is more efficient when you can compile all of the data ahead of time into an array-of-arrays, so I'd recommend doing that if this is the end goal.
For example:
lines = [ ]
# ...
if (...)
# Append an array to the lines to write to the CSV file.
lines << [ ... ]
end
Keep your data in a standard structure like an Array, a Hash, or a custom object, until you're prepared to commit it to its final formatted or encoded form. That way you can perform additional operations on it if you need to do things like filtering.
It's hard to refactor this when I'm not exactly sure what it's supposed to be doing, but assuming that you want to log an error when there's no entry matching a job & code pair, here's what I've come up with:
def fetch_by_job_and_cost_code(job, cost_code)
timberline_db['SELECT Job, Cost_Code FROM [JCM_MASTER__COST_CODE] WHERE [Job] = ? AND [Cost_Code] = ?', job, cost_code]
end
if fetch_by_job_and_cost_code(job, clean_cost_code).none?
add_to_exception_output_file "#{linenum},#{id},,Employees default job and cost code do not exist in timberline. job:#{job} cost code:#{clean_cost_code}"
end

Ruby/Builder API: create XML without using blocks

I'd like to use Builder to construct a set of XML files based on a table of ActiveRecord models. I have nearly a million rows, so I need to use find_each(batch_size: 5000) to iterate over the records and write an XML file for each batch of them, until the records are exhausted. Something like the following:
filecount = 1
count = 0
xml = ""
Person.find_each(batch_size: 5000) do |person|
xml += person.to_xml # pretend .to_xml() exists
count += 1
if count == MAX_PER_FILE
File.open("#{filecount}.xml", 'w') {|f| f.write(xml) }
xml = ""
filecount += 1
count = 0
end
end
This doesn't work well with Builder's interface, as it wants to work in blocks, like so:
xml = builder.person { |p| p.name("Jim") }
Once the block ends, Builder closes its current stanza; you can't keep a reference to p and use it outside of the block (I tried). Basically, Builder wants to "own" the iteration.
So to make this work with builder, I'd have to do something like:
filecount = 0
offset = 0
while offset < Person.count do
count = 0
builder = Builder::XmlMarkup.new(indent: 5)
xml = builder.people do |people|
Person.limit(MAX_PER_FILE).offset(offset).each do |person|
people.person { |p| p.name(person.name) }
count += 1
end
end
File.open("#output#file_count.xml", 'w') {|f| f.write(xml) }
filecount += 1
offset += count
end
Is there a way to use Builder without the block syntax? Is there a way to programmatically tell it "close the current stanza" rather than relying on a block?
My suggestion: don't use builder.
XML is a simple format as long as you escape the xml entities correctly.
Batch your db retrieve then just write out the batch as xml to a file handle. Don't buffer via a string as your example shows. Just write to the filehandle. Let the OS deal with buffering. Files can be of any size, why the limit?
Also, don't include the indentation spaces, with million rows, they'd add up.
Added
When writing xml files, I also include xml comments at the top of the file:
The name of the software and version that generated the xml file
Date / timestamp the file was written
Other useful info. Eg in this case you could say that the file is batch # x of the original data set.
I ended up generating the XML manually, as per Larry K's suggestion. Ruby's built-in XML encoding made this a piece of cake. I'm not sure why this feature not more widely advertised... I wasted a lot of time Googling and trying various to_xs implementations before I stumbled upon the built-in "foo".encode(xml: :text).
My code now looks like:
def run
count = 0
Person.find_each(batch_size: 5000) do |person|
open_new_file if #current_file.nil?
# simplified- I actually have many more fields and elements
#
#current_file.puts " <person>#{person.name.encode(xml: :text)}</person>"
count += 1
if count == MAX_PER_FILE
close_current_file
count = 0
end
end
close_current_file
end
def open_new_file
#file_count += 1
#current_file = File.open("people#{#file_count}.xml", 'w')
#current_file.puts "<?xml version='1.0' encoding='UTF-8'?>"
#current_file.puts " <people>"
end
def close_current_file
unless #current_file.nil?
#current_file.puts " </people>"
#current_file.close
#current_file = nil
end
end

Resources