How to define a fixed-width constraint in parslet - ruby

I am looking into parslet to write alot of data import code. Overall, the library looks good, but I'm struggling with one thing. Alot of our input files are fixed width, and the widths differ between formats, even if the actual field doesn't. For example, we might get a file that has a 9-character currency, and another that has 11-characters (or whatever). Does anyone know how to define a fixed width constraint on a parslet atom?
Ideally, I would like to be able to define an atom that understands currency (with optional dollar signs, thousand separators, etc...) And then I would be able to, on the fly, create a new atom based on the old one that is exactly equivalent, except that it parses exactly N characters.
Does such a combinator exist in parslet? If not, would it be possible/difficult to write one myself?

What about something like this...
class MyParser < Parslet::Parser
def initialize(widths)
#widths = widths
super
end
rule(:currency) {...}
rule(:fixed_c) {currency.fixed(#widths[:currency])}
rule(:fixed_str) {str("bob").fixed(4)}
end
puts MyParser.new.fixed_str.parse("bob").inspect
This will fail with:
"Expected 'bob' to be 4 long at line 1 char 1"
Here's how you do it:
require 'parslet'
class Parslet::Atoms::FixedLength < Parslet::Atoms::Base
attr_reader :len, :parslet
def initialize(parslet, len, tag=:length)
super()
raise ArgumentError,
"Asking for zero length of a parslet. (#{parslet.inspect} length #{len})" \
if len == 0
#parslet = parslet
#len = len
#tag = tag
#error_msgs = {
:lenrep => "Expected #{parslet.inspect} to be #{len} long",
:unconsumed => "Extra input after last repetition"
}
end
def try(source, context, consume_all)
start_pos = source.pos
success, value = parslet.apply(source, context, false)
return succ(value) if success && value.str.length == #len
context.err_at(
self,
source,
#error_msgs[:lenrep],
start_pos,
[value])
end
precedence REPETITION
def to_s_inner(prec)
parslet.to_s(prec) + "{len:#{#len}}"
end
end
module Parslet::Atoms::DSL
def fixed(len)
Parslet::Atoms::FixedLength.new(self, len)
end
end

Methods in parser classes are basically generators for parslet atoms. The simplest form these methods come in are 'rule's, methods that just return the same atoms every time they are called. It is just as easy to create your own generators that are not such simple beasts. Please look at http://kschiess.github.com/parslet/tricks.html for an illustration of this trick (Matching strings case insensitive).
It seems to me that your currency parser is a parser with only a few parameters and that you could probably create a method (def ... end) that returns currency parsers tailored to your liking. Maybe even use initialize and constructor arguments? (ie: MoneyParser.new(4,5))
For more help, please address your questions to the mailing list. Such questions are often easier to answer if you illustrate it with code.

Maybe my partial solution will help to clarify what I meant in the question.
Let's say you have a somewhat non-trivial parser:
class MyParser < Parslet::Parser
rule(:dollars) {
match('[0-9]').repeat(1).as(:dollars)
}
rule(:comma_separated_dollars) {
match('[0-9]').repeat(1, 3).as(:dollars) >> ( match(',') >> match('[0-9]').repeat(3, 3).as(:dollars) ).repeat(1)
}
rule(:cents) {
match('[0-9]').repeat(2, 2).as(:cents)
}
rule(:currency) {
(str('$') >> (comma_separated_dollars | dollars) >> str('.') >> cents).as(:currency)
# order is important in (comma_separated_dollars | dollars)
}
end
Now if we want to parse a fixed-width Currency string; this isn't the easiest thing to do. Of course, you could figure out exactly how to express the repeat expressions in terms of the final width, but it gets really unnecessarily tricky, especially in the comma separated case. Also, in my use case, currency is really just one example. I want to be able to have an easy way to come up with fixed-width definitions for adresses, zip codes, etc....
This seems like something that should be handle-able by a PEG. I managed to write a prototype version, using Lookahead as a template:
class FixedWidth < Parslet::Atoms::Base
attr_reader :bound_parslet
attr_reader :width
def initialize(width, bound_parslet) # :nodoc:
super()
#width = width
#bound_parslet = bound_parslet
#error_msgs = {
:premature => "Premature end of input (expected #{width} characters)",
:failed => "Failed fixed width",
}
end
def try(source, context) # :nodoc:
pos = source.pos
teststring = source.read(width).to_s
if (not teststring) || teststring.size != width
return error(source, #error_msgs[:premature]) #if not teststring && teststring.size == width
end
fakesource = Parslet::Source.new(teststring)
value = bound_parslet.apply(fakesource, context)
return value if not value.error?
source.pos = pos
return error(source, #error_msgs[:failed])
end
def to_s_inner(prec) # :nodoc:
"FIXED-WIDTH(#{width}, #{bound_parslet.to_s(prec)})"
end
def error_tree # :nodoc:
Parslet::ErrorTree.new(self, bound_parslet.error_tree)
end
end
# now we can easily define a fixed-width currency rule:
class SHPParser
rule(:currency15) {
FixedWidth.new(15, currency >> str(' ').repeat)
}
end
Of course, this is a pretty hacked solution. Among other things, line numbers and error messages are not good inside of a fixed width constraint. I would love to see this idea implemented in a better fashion.

Related

Running method on two variables at once

I am wondering what the proper way is to refactor this code for efficiency besides running it twice.
class Hamming
def compute (a, b)
a.to_a.split("")
b.to_a.split("")
end
end
Is there something similar to assigning two variables at once like
a, b = 1, 2?
First off, your code is invalid. #to_a returns an array; #split is not defined on arrays.
Secondly, if your code was valid (say, a.to_s.split(""); b.to_s.split(""), it would not actually do much, because your code would just return the value of the last executed statement (b.to_s.split("")). Both #to_s and #split are non-destructive, which means they will not change a or b - the only effect you get from this function is what it returns, and you do not return the result of a.to_s.split("") in any way: it is forgotten.
If you meant something like this:
class Hamming
def compute(a, b)
[
a.to_s.split(""),
b.to_s.split("")
]
end
end
this is fairly readable. However, if you had more complex operation than just .to_s.split(""), it would be better to isolate it into its own function:
class Hamming
def compute(a, b)
[
list_chars(a),
list_chars(b)
]
end
private def list_chars(str)
str.to_s.split("")
end
end
You could simplify it even more using map, but it really only becomes necessary when you have multiple elements, as the two-element case is perfectly legible as-is. However, here goes:
class Hamming
def compute(a, b)
[a, b].map { |x| list_chars(x) }
end
private def list_chars(str)
str.to_s.split("")
end
end
Also, you might want to see the method #each_char, giving you an iterator, which is more readable, and often the more correct choice, than .split("").
EDIT: After thinking about it a bit, it seems like you're starting a method to evaluate a Hamming distance between two strings; and that you do not intend to have that function simply return the character of the two strings. In that case, I'd just write this:
def compute(a, b)
a_chars = a.to_s.each_char
b_chars = b.to_s.each_char
# ...
end
or possibly this, if you absolutely need to have characters themselves, and not an iterator:
def compute(a, b)
a_chars = a.to_s.each_char.to_a
b_chars = b.to_s.each_char.to_a
# ...
end
The solution I believe you are looking for would look like this:
def compute(a, b)
a_chars, b_chars = *[a, b].map { |x| x.to_s.each_char.to_a }
# ...
end
but I'd consider that less readable than the non-DRY one; if you really want to DRY it up, extract the listification into its own function as described above, and just do
a_chars = list_chars(a)
b_chars = list_chars(b)
which is actually the best of both worlds, even if it is a bit of an overkill in this case: it is DRY-ly maintainable and self-documentingly legible, for a bit of tradeoff in verbosity.
Since the code doesn't make sense, I think what you're asking is how do you avoid repeating yourself.
Simple, write another method and call that. Here's an example of wanting to find out which phrase is longer, but you want to ignore lots of whitespace. So foo bar isn't longer than 12345678.
def longer_phrase(phraseA, phraseB)
normalizedA = normalize(phraseA)
normalizedB = normalize(phraseB)
return normalizedA.length > normalizedB.length ? phraseA : phraseB
end
def normalize(phrase)
normalized = phrase.gsub(/\s+/, ' ');
normalized.strip!
return normalized
end
puts longer_phrase("foo bar ", "12345678")
Needing to normalize all your data before doing work on it comes up a lot. This avoids repeating yourself. It makes your code easier to understand, since we know what the point of all that work is, to normalize the string. And it gives you a normalization function to use elsewhere so you're normalizing your data the same way.

Not displaying it's corresponding values with it's key for Hash

Ok i am not here to ask for an answer. But to be honest i am not really good in class variable. So i would appreciate you can guide me along with this piece of code.
I have read on class variable at those docs. I some what kind of understand it. But it comes to applying it for my own use. I would get confused.
class Square
##sqArray = {}
#attr_accessor :length
def initialize
if defined?(##length)
randno = "%s" % [rand(20)]
##length = randno.to_i
##sqArray = ##length
else
randno = "%s" % [rand(20)]
##length = randno.to_i
##sqArray = ##length
end
end
def Area
##area = ##length * ##length
return ##area
##sqArray[##length.to_sym] = ##area
puts ##sqArray
end
end
s1 = Square.new
puts s1.Area
Let me explain this piece of code. Basically every time i create a Square object it would go to initialize method. A random number will be generated and pass it to ##length, and ##length will be assigned to hash ##sqArray as it's key. But now the problem is when i create a new object s1. When i want to display the Area i want to test out to print the hash ##sqArray with it's length as it's key and area as it's value. But now the problem is only returning it's area only. e.g 114 only.
suppose to be e.g [ 24 => 114]
When defining the object's property (i.e. it's length), the correct approach is to use an instance variable, not a class variable. This is because (in your particular example), length is an attribute of a specific square and not something that applies to all squares. Your code should look something like this:
class Square
def initialize(length = rand(20))
#length = length
end
def area
#length * #length
end
end
s1 = Square.new
puts s1.area
Now, I am a little unclear what exactly you aim to achieve by use of that class variable ##sqArray - but for example, you could use this store a list of all defined Squares:
class Square
##squares_list = []
def self.all_known
##squares_list
end
def initialize(length = rand(20))
#length = length
##squares_list << self
end
def area
#length * #length
end
end
This would allow you to write code like:
s1 = Square.new #=> #<Square:0x0000000132dbc8 #length=9>
s2 = Square.new(20) #=> #<Square:0x000000012a1038 #length=20>
s1.area #=> 81
s2.area #=> 400
Square.all_known #=> [#<Square:0x0000000132dbc8 #length=9>, #<Square:0x000000012a1038 #length=20>]
Class variables have some odd behaviour and limited use cases however; I would generally advise that you avoid them when starting out learning Ruby. Have a read through a ruby style guide to see some common conventions regarding best practice - including variable/method naming (use snake_case not camelCase or PascalCase), whitespace, etc.

Dynamic methods using define_method and eval

I've put together two sample classes implemented in a couple of different ways which pretty well mirrors what I want to do in my Rails model. My concern is that I don't know what, if any are the concerns of using either method. And I've only found posts which explain how to implement them or a general warning to avoid/ be careful when using them. What I have not found is a clear explanation of how to accomplish this safely, and what I'm being careful of or why I should avoid this pattern.
class X
attr_accessor :yn_sc, :um_sc
def initialize
#yn_sc = 0
#um_sc = 0
end
types = %w(yn um)
types.each do |t|
define_method("#{t}_add") do |val|
val = ActiveRecord::Base.send(:sanitize_sql_array, ["%s", val])
eval("##{t}_sc += #{val}")
end
end
end
class X
attr_accessor :yn_sc, :um_sc
def initialize
#yn_sc = 0
#um_sc = 0
end
types = %w(yn um)
types.each do |t|
# eval <<-EVAL also works
self.class_eval <<-EVAL
def #{t}_add(val)
##{t}_sc += val
end
EVAL
end
end
x = X.new
x.yn_add(1) #=> x.yn_sc == 1 for both
Well, your code looks realy safe. But imagine a code based on user input. It might be look something like
puts 'Give me an order, sir!'
order = gets.chomp
eval(order)
What will happen if our captain will go wild and order us to 'rm -rf ~/'? Sad things for sure!
So take a little lesson. eval is not safe because it evaluates every string it receives.
But there's another reason not to use eval. Sometimes it evaluates slower than alternatives. Look here if interested.

Ruby next multiple

Is there another way to write 'a'.next.next? I've looked all over and can't seem to find it.
I've tried multiplying the .next but I keep getting errors.
Well, this might not be a good idea in the case here, but if you're looking to chain a method n times in general, you can do something like this:
2.times.inject('a') { |s| s.next }
# => 'c'
20.times.inject('a') { |s| s.next }
# => 'u'
This starts with the value 'a', runs a block that calls next, then each successive result is fed back into the block.
For what it's worth, monkey-patching String can be fine for trivial scripts, but personally I'd try to look for other solutions first, like just adding a utility function to your class/module:
def repeat_next(str, n = 1)
n.times.inject(str) { |s| s.next }
end
A shortcut for your specific problem, (a.ord + 2).chr, potentially exists, although it's not the same thing.
You can just redefine String.next like this:
class String
alias_method :next1, :next
def next(n = 1)
str = self
for i in 1..n
str = str.next1
end
str
end
end
puts 'a'.next
puts 'a'.next(2)
puts 'a'.next(20)
If you're looking for a more succinct way of doing this, you could use: ('a'.ord + 2).chr. This will convert 'a' to a numerical representation (with the "ord" method), increment it by two, then converts it back to the character representation (with "chr").
You can monkey-patch the String class in ruby to add a method to do this for you:
class String
def get_nth_char(n)
current = self
while n > 0 do
current = current.next
n = n - 1
end
current
end
end
So you can do 'a'.get_nth_char(2) # => 'c'

Is adding nowiki-tags to this parser feasible?

Update: for the record, here's the implementation I ended up using.
Here's a trimmed down version of a parser I'm working on. There's still some code, but it should be quite easy to grasp the basic concepts of this parser.
class Markup
def initialize(markup)
#markup = markup
end
def to_html
#html ||= #markup.split(/(\r\n){2,}|\n{2,}/).map {|p| Paragraph.new(p).to_html }.join("\n")
end
class Paragraph
def initialize(paragraph)
#p = paragraph
end
def to_html
#p.gsub!(/'{3}([^']+)'{3}/, "<strong>\\1</strong>")
#p.gsub!(/'{2}([^']+)'{2}/, "<em>\\1</em>")
#p.gsub!(/`([^`]+)`/, "<code>\\1</code>")
case #p
when /^=/
level = (#p.count("=") / 2) + 1 # Starting on h2
#p.gsub!(/^[= ]+|[= ]+$/, "")
"<h#{level}>" + #p + "</h#{level}>"
when /^(\*|\#)/
# I'm parsing lists here. Quite a lot of code, and not relevant, so
# I'm leaving it out.
else
#p.gsub!("\n", "\n<br/>")
"<p>" + #p + "</p>"
end
end
end
end
p Markup.new("Here is `code` and ''emphasis'' and '''bold'''!
Baz").to_html
# => "<p>Here is <code>code</code> and <em>emphasis</em> and <strong>bold</strong>!</p>\n<p>Baz</p>"
So, as you can see, I'm breaking the text into paragraphs, and each paragraph is either a header, a list or a regular paragraph.
Is it feasible to add support for nowiki tags (where everything between <nowiki></nowiki> is not being parsed) for a parser like this? Feel free to answer "no", and suggest alternative methods of creating a parser :)
As a sidenote, you can see the actual parser code on Github. markup.rb and paragraph.rb
If you make use of a simple tokenizer, it's much easier to manage this sort of thing. One approach is to create a single regular expression that can capture your entire grammar, but this might prove to be problematic. An alternative is to split up the document into sections that need to be rewritten, and sections that should be skipped, which is likely the easier approach here.
Here's a simple framework you can extend as required:
def wiki_subst(string)
buffer = string.dup
result = ''
while (m = buffer.match(/<\s*nowiki\s*>.*?<\s*\/\s*nowiki\s*>/i))
result << yield(m.pre_match)
result << m.to_s
buffer = m.post_match
end
result << yield(buffer)
result
end
example = "replace me<nowiki>but not me</nowiki>replace me too<NOWIKI>but not me either</nowiki>and me"
puts wiki_subst(example) { |s| s.upcase }
# => REPLACE ME<nowiki>but not me</nowiki>REPLACE ME TOO<NOWIKI>but not me either</nowiki>AND ME

Resources