Ruby parameterize if ... then blocks - ruby

I am parsing a text file and want to be able to extend the sets of tokens that can be recognized easily. Currently I have the following:
if line =~ /!DOCTYPE/
puts "token doctype " + line[0,20]
#ast[:doctype] << line
elsif line =~ /<html/
puts "token main HTML start " + line[0,20]
html_scanner_off = false
elsif line =~ /<head/ and not html_scanner_off
puts "token HTML header starts " + line[0,20]
html_header_scanner_on = true
elsif line =~ /<title/
puts "token HTML title " + line[0,20]
#ast[:HTML_header_title] << line
end
Is there a way to write this with a yield block, e.g. something like:
scanLine("title", :HTML_header_title, line)
?

Don't parse HTML with regexes.
That aside, there are several ways to do what you're talking about. One:
class Parser
class Token
attr_reader :name, :pattern, :block
def initialize(name, pattern, block)
#name = name
#pattern = pattern
#block = block
end
def process(line)
#block.call(self, line)
end
end
def initialize
#tokens = []
end
def scanLine(line)
#tokens.find {|t| line =~ t.pattern}.process(line)
end
def addToken(name, pattern, &block)
#tokens << Token.new(name, pattern, block)
end
end
p = Parser.new
p.addToken("title", /<title/) {|token, line| puts "token #{token.name}: #{line}"}
p.scanLine('<title>This is the title</title>')
This has some limitations (like not checking for duplicate tokens), but works:
$ ruby parser.rb
token title: <title>This is the title</title>
$

If you're intending to parse HTML content, you might want to use one of the HTML parsers like nokogiri (http://nokogiri.org/) or Hpricot (http://hpricot.com/) which are really high-quality. A roll-your-own approach will probably take longer to perfect than figuring out how to use one of these parsers.
On the other hand, if you're dealing with something that's not quite HTML, and can't be parsed that way, then you'll need to roll your own somehow. There's a few Ruby parser frameworks out there that may help, but for simple tasks where performance isn't a critical factor, you can get by with a pile of regexps like you have here.

Related

Compare REXML elements for name/attribute equality in RSpec

Is there a matcher for comparing REXML elements for logical equality in RSpec? I tried writing a custom matcher that converts them to formatted strings, but it fails if the attribute order is different. (As noted in the XML spec, the order of attributes should not be significant.)
I could grind through writing a custom matcher that compares the name, namespace, child nodes, attributes, etc., etc., but this seems time-consuming and error-prone, and if someone else has already done it I'd rather not reinvent the wheel.
I ended up using the equivalent-xml gem and writing an RSpec custom matcher to convert the REXML to Nokogiri, compare with equivalent-xml, and pretty-print the result if needed.
The test assertion is pretty simple:
expect(actual).to be_xml(expected)
or
expect(actual).to be_xml(expected, path)
if you want to display the file path or some sort of identifier (e.g. if you're comparing a lot of documents).
The match code is a little fancier than it needs to be because it handles REXML, Nokogiri, and strings.
module XMLMatchUtils
def self.to_nokogiri(xml)
return nil unless xml
case xml
when Nokogiri::XML::Element
xml
when Nokogiri::XML::Document
xml.root
when String
to_nokogiri(Nokogiri::XML(xml, &:noblanks))
when REXML::Element
to_nokogiri(xml.to_s)
else
raise "be_xml() expected XML, got #{xml.class}"
end
end
def self.to_pretty(nokogiri)
return nil unless nokogiri
out = StringIO.new
save_options = Nokogiri::XML::Node::SaveOptions::FORMAT | Nokogiri::XML::Node::SaveOptions::NO_DECLARATION
nokogiri.write_xml_to(out, encoding: 'UTF-8', indent: 2, save_with: save_options)
out.string
end
def self.equivalent?(expected, actual, filename = nil)
expected_xml = to_nokogiri(expected) || raise("expected value #{expected || 'nil'} does not appear to be XML#{" in #{filename}" if filename}")
actual_xml = to_nokogiri(actual)
EquivalentXml.equivalent?(expected_xml, actual_xml, element_order: false, normalize_whitespace: true)
end
def self.failure_message(expected, actual, filename = nil)
expected_string = to_pretty(to_nokogiri(expected))
actual_string = to_pretty(to_nokogiri(actual)) || actual
# Uncomment this to dump expected/actual to file for manual diffing
#
# now = Time.now.to_i
# FileUtils.mkdir('tmp') unless File.directory?('tmp')
# File.open("tmp/#{now}-expected.xml", 'w') { |f| f.write(expected_string) }
# File.open("tmp/#{now}-actual.xml", 'w') { |f| f.write(actual_string) }
diff = Diffy::Diff.new(expected_string, actual_string).to_s(:text)
"expected XML differs from actual#{" in #{filename}" if filename}:\n#{diff}"
end
def self.to_xml_string(actual)
to_pretty(to_nokogiri(actual))
end
def self.failure_message_when_negated(actual, filename = nil)
"expected not to get XML#{" in #{filename}" if filename}:\n\t#{to_xml_string(actual) || 'nil'}"
end
end
The actual matcher is fairly straightforward:
RSpec::Matchers.define :be_xml do |expected, filename = nil|
match do |actual|
XMLMatchUtils.equivalent?(expected, actual, filename)
end
failure_message do |actual|
XMLMatchUtils.failure_message(expected, actual, filename)
end
failure_message_when_negated do |actual|
XMLMatchUtils.failure_message_when_negated(actual, filename)
end
end

Ruby create a class from a specification file

I have a specification file Spec.txt like this
title :Test
attribute :fieldOne, String
attribute :fieldTwo, Fixnum
constraint :fieldOne, 'fieldOne != nil'
constraint :fieldTwo, 'fieldTwo >= 0'
from which I need to dynamically create a class with classname Test and the attributes fieldOne and fieldTwo and the constraints of the attributes.
I got so far to read in the file split up the lines and store them into arrays and then dynamically create the class with
dynamic_name = ##TITLE
Object.const_set(dynamic_name, Class.new {
def init *args
...
end
})
But I am not sure if this is the right way to go or even how to create the attributes and the constraints now?
One approach might be:
file=File.open('Spec.txt')
attrs=[]
constraints=[]
all_attrs=""
new_class=""
file.each do |line|
if line =~ /title/
value= line.split[1].tr(':,','')
new_class=value
elsif line =~ /attribute/
value= line.split[1]
attrs << value
elsif line =~ /constraint/
field= line.split[2].tr('\'','')
constraint= line.split[3]
constraints << "\n def #{field}=\n validation here (#{constraint}) \n end\n"
end
end
attrs.map!{|attr| attr+" "}
all_attrs.chomp!(", ")
all_constraints=constraints.join
result=
"Class "+new_class+"\n"+
"attr_reader "+
"#{all_attrs}\n"+
"#{all_constraints}\n"+
"end\n"
printf "#{result}"
run:
$ ruby create_class.rb
Class Test
attr_reader :fieldOne, :fieldTwo
def fieldOne=
validation here (!=)
end
def fieldTwo=
validation here (>=)
end
end
$
Needs some more work on the validations but you get the idea.
To use immediately you could send the output to a ruby file and then include it as code, e.g.
# You would add this after the first section of code, after the 'printf "#{result}"'
File.open("#{new_class}.rb", "w") do |file|
file.write(result)
end
require_relative "#{new_class}.rb"
test_it= Object.const_get(new_class).new
puts "#{test_it}"
Otherwise if creating the ruby file is enough:
ruby create_class.rb > class.rb
As Vaughan suggested.

How do I test reading a file?

I'm writing a test for one of my classes which has the following constructor:
def initialize(filepath)
#transactions = []
File.open(filepath).each do |line|
next if $. == 1
elements = line.split(/\t/).map { |e| e.strip }
transaction = Transaction.new(elements[0], Integer(1))
#transactions << transaction
end
end
I'd like to test this by using a fake file, not a fixture. So I wrote the following spec:
it "should read a file and create transactions" do
filepath = "path/to/file"
mock_file = double(File)
expect(File).to receive(:open).with(filepath).and_return(mock_file)
expect(mock_file).to receive(:each).with(no_args()).and_yield("phrase\tvalue\n").and_yield("yo\t2\n")
filereader = FileReader.new(filepath)
filereader.transactions.should_not be_nil
end
Unfortunately this fails because I'm relying on $. to equal 1 and increment on every line and for some reason that doesn't happen during the test. How can I ensure that it does?
Global variables make code hard to test. You could use each_with_index:
File.open(filepath) do |file|
file.each_with_index do |line, index|
next if index == 0 # zero based
# ...
end
end
But it looks like you're parsing a CSV file with a header line. Therefore I'd use Ruby's CSV library:
require 'csv'
CSV.foreach(filepath, col_sep: "\t", headers: true, converters: :numeric) do |row|
#transactions << Transaction.new(row['phrase'], row['value'])
end
You can (and should) use IO#each_line together with Enumerable#each_with_index which will look like:
File.open(filepath).each_line.each_with_index do |line, i|
next if i == 1
# …
end
Or you can drop the first line, and work with others:
File.open(filepath).each_line.drop(1).each do |line|
# …
end
If you don't want to mess around with mocking File for each test you can try FakeFS which implements an in memory file system based on StringIO that will clean up automatically after your tests.
This way your test's don't need to change if your implementation changes.
require 'fakefs/spec_helpers'
describe "FileReader" do
include FakeFS::SpecHelpers
def stub_file file, content
FileUtils.mkdir_p File.dirname(file)
File.open( file, 'w' ){|f| f.write( content ); }
end
it "should read a file and create transactions" do
file_path = "path/to/file"
stub_file file_path, "phrase\tvalue\nyo\t2\n"
filereader = FileReader.new(file_path)
expect( filereader.transactions ).to_not be_nil
end
end
Be warned: this is an implementation of most of the file access in Ruby, passing it back onto the original method where possible. If you are doing anything advanced with files you may start running into bugs in the FakeFS implementation. I got stuck with some binary file byte read/write operations which weren't implemented in FakeFS quite how Ruby implemented them.

Parse REXML Document, ignoring whitespace

Should REXML ignore identation or whitespacing?
I am debugging an issue with a simple HTML to Markdown convertor. For some reason it fails on
<blockquote><p>foo</p></blockquote>
But not on
<blockquote>
<p>foo</p>
</blockquote>
The reason is, that in the first case, type.children.first.value is not set, in the latter case it is.
The original code can be found at link above, but a condensed snipped to show the problem is below:
require 'rexml/document'
include REXML
def parse_string(string)
doc = Document.new("<root>\n"+string+"\n</root>")
root = doc.root
root.elements.each do |element|
parse_element(element, :root)
end
end
def parse_element(element, parent)
#output = ''
# ...
#output << opening(element, parent)
#...
end
def opening(type, parent)
case type.name.to_sym
#...
when :blockquote
# remove leading newline
type.children.first.value = ""
"> "
end
end
#Parses just fine
puts parse_string("<blockquote>\n<p>foo</p>\n</blockquote>")
# Fails with undefined method `value=' for <p> ... </>:REXML::Element (NoMethodError)
puts parse_string("<blockquote><p>foo</p></blockquote>")
I am quite certain, this is due to some parameter that makes REXML require whitespacing and identation: why else would it parse the first XML different from the latter?
Can I force REXML to parse both the same? Or am I looking at a whole different kind of bug?
Try passing the option :ignore_whitespace_nodes=>:all to Document.new().

How to search an XML when parsing it using SAX in nokogiri

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title tag?
You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
#in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
#in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "#in_title = false if element == 'title'"
#in_title = false
end
def characters(text)
puts text if #in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title> tag.

Resources