StringScanner is matching a string as though it was one position back - ruby

I'm trying to use a StringScanner to parse a string into tokens for processing later. All was going well until I tested the regex syntax parsing. Regexen look like this:
r|hello|gmi
r:there|there:gmi
r/:(?=[jedi])[sith]:/gmi
r!hello!gmi
That is, r, followed by | (or a couple of other characters, but that's irrelevant right now), followed by the body of the regex -- which can include escaped characters, like \| and \\ -- then another |, and then the flags of the regex.
To look for regex literals, I'm using code that looks an awful lot like this:
require 'strscan'
scanner = StringScanner.new('r|abc| ')
puts "pre-regex: #{scanner.inspect}"
puts "got a char: #{scanner.getch} (res: #{scanner.inspect})"
divider = scanner.getch
puts "got divider: #{divider.inspect}"
puts "mid-regex: #{scanner.inspect}"
# this bit still fails even if you replace `#{divider}' with `|'
res = scanner.scan_until(/(?<![^\\]\\)#{divider}[a-z]*/)
puts "post-regex: #{scanner.inspect}"
if scanner.skip(/\s+/)# || scanner.skip(/;-.*?-;/m)
puts "Success! #{res}"
else
puts "Fail. Ended at: #{scanner.inspect}"
puts "(res was #{res.inspect})"
end
Try it online at ideone
Here, I've trimmed it down as much as I think I can to show the problem clearly. In the real code, it's part of a much large piece of code, the vast majority of which isn't relevant. I've narrowed down the bug to this part -- you can use the link to see that it's there -- but I can't figure out why this isn't correctly scanning until the next instance of |, then returning the flags.
As a side note, if there's a better way to do this, please let me know. I've found that I quite like StringScanner, but that might be because I'm obsessed with regexen, to the point that I call them regexen.
TL;DR: Why is StringScanner apparently matching as though its position was one character back, and how can I make it work right?

The problem is that Ruby interpolates the regexp literal with the string as is, for example
divider = '|'
/(?<![^\\]\\)#{divider}[a-z]*/
=> /(?<![^\\]\\)|[a-z]*/
To escape the divider, you can
divider = '|'
/(?<![^\\]\\)#{Regexp.quote divider}[a-z]*/
=> /(?<![^\\]\\)\|[a-z]*/
And this modification makes the code pass, but you still need to verify that a divider is a non-word character.

Related

What is this regex replacing?

I have this line in a Ruby file loading program:
row_hash.map{|k,v| v.gsub!(/\A"|"\Z/, '').try(:strip!) if !v.nil? }
I remember adding it, though the reason escapes me. I know that \A and \Z are the start and end of a string, respectively.
I've written regexes intermittently for 15 years, but the "|" is what's really mystifying me?
It strips quotes from strings.
This regex suffers from leaning toothpick syndrome. We can ease that by using %r, balanced delimiters, and extended formatting to ignore whitespace.
%r{ \A" | "\Z }x;
It matches a quote at the beginning of the string, or one at the end (or just before a newline).
So looking at it all together...
v.gsub!( %r{ \A" | "\Z }x;, '' ).try(:strip) if !v.nil?
The gsub! will apply the match until it doesn't match anymore. So it will match quotes at the beginning and end of v and replace them with nothing, all in place to v. The end result is v is stripped of beginning and ending quotes.
Then there's the blah.try(:strip). That's a Rails extension which is roughly equivalent to...
blah.strip if blah
Since gsub! will return null if the match fails, that means it will strip v only if it was in quotes. It will do it after the quotes have been stripped and it will only do it if there were quotes. I suspect this is not the intended behavior.
However, strip doesn't alter v in place so probably does nothing unless you're using the return value of map which would make this even more complicated. You probably want try(:strip!).
Finally if !v.nil? means all that will only happen if v wasn't nil. Putting it at the end of an already complicated statement makes things even harder to understand.
This is a bit over-complicated as one line. It would be better if the nil check was done separate and the whole thing properly spaced out. I've also decided to use an if condition instead of try to make it more obvious the stripping only happens if the gsub matches, I don't think that's the desired behavior and want it to be really obvious to anyone reading it.
row_hash.map { |_,v|
next if v.nil?
if v.gsub!( %r{ \A" | "\Z }x;, '' )
v.strip!
end
}
Finally, since the behavior is really specific and finicky (and probably subtly wrong) the inner portion should be turned into a method so it can be named, documented and tested.
row_hash.map { |_,v| v.strip_quotes! }
It replaces the quote character at the start and end of the string. It ignores other occurrences of the character. Here's a sample of how the regex works.
http://rubular.com/r/pVMbQ9aqSl
"|" does not mean that the pipe is quoted. It basically matches \A" (start of the string followed by " ) or "\Z ( " followed by end of the string)
Let me know if this helps.

Generating a character class

I'm trying to censor letters in a word with word.gsub(/[^#{guesses}]/i, '-'), where word and guesses are strings.
When guesses is "", I get this error RegexpError: empty char-class: /[^]/i. I could sort such cases with an if/else statement, but can I add something to the regex to make it work in one line?
Since you are only matching (or not matching) letters, you can add a non-letter character to your regex, e.g. # or %:
word.gsub(/[^%#{guesses}]/i, '-')
See IDEONE demo
If #{guesses} is empty, the regex will still be valid, and since % does not appear in a word, there is no risk of censuring some guessed percentage sign.
You have two options. One is to avoid testing if your matches are empty, that is:
unless (guesses.empty?)
word.gsub(/^#{Regex.escape(guesses)}/i, '-')
end
Although that's not your intention, it's really the safest plan here and is the most clear in terms of code.
Or you could use the tr function instead, though only for non-empty strings, so this could be substituted inside the unless block:
word.tr('^' + guesses.downcase + guesses.upcase, '-')
Generally tr performs better than gsub if used frequently. It also doesn't require any special escaping.
Edit: Added a note about tr not working on empty strings.
Since tr treats ^ as a special case on empty strings, you can use an embedded ternary, but that ends up confusing what's going on considerably:
word.tr(guesses.empty? ? '' : ('^' + guesses.downcase + guesses.upcase), '-')
This may look somewhat similar to tadman's answer.
Probably you should keep the string that represents what you want to hide, instead of what you want to show. Let's say this is remains. Then, it would be easy as:
word.tr(remains.upcase + remains.downcase, "-")

Regular expression to find first letter in a string

Consider this example string:
mystr ="1. moody"
I want to capitalize the first letter that occurs in mystr. I am trying this regular expression in Ruby but still returns all the letters in mystr (moody) instead of the letter m only.
puts mystr.scan(/[a-zA-Z]{1}/)
Any help appreciated!
Do as below using String#sub
(arup~>~)$ pry --simple-prompt
>> s = "1. moody"
=> "1. moody"
>> s.sub(/[a-z]/i,&:upcase)
=> "1. Moody"
>>
If you want to modify the source string use s.sub!(/[a-z]/,&:upcase).
Just for completeness, although it doesn’t directly answer your question as posed but could be relevant, consider this variation:
mystr ="1. école"
The line mystr.sub(/[a-z]/i,&:upcase) (as in Arup Rakshit’s answer) will match the second letter of the word, producing
1. éCole
The line mystr.sub /\b\s?[a-zA-Z]{1}/, &:upcase (diego.greyrobot’s answer) won’t match at all and so the line will be unchanged.
There are two problems here. The first is that [a-zA-Z] doesn’t match accented characters, so é isn’t matched. The fix for this is to use the \p{Letter} character property:
mystr.sub /\p{Letter}/, &:upcase
This will match the character in question, but won’t change it. This is due to the second problem, which is that upcase (and downcase) only works on characters in the ASCII range. This is almost as easy to fix, but relies on using an external library such as unicode_utils:
require 'unicode_utils'
mystr.sub(/\p{Letter}/) { |c| UnicodeUtils.upcase(c)}
This results in:
1. École
which is probably what is wanted in this case.
This may not affect you if you are sure all your data is just ASCII, but is worth knowing for other situations.
The reason your attempt returns all the letters is because you are using the scan method which does just that, it returns all the characters which match the regex, in your case letters. For your use case you should use sub since you only want to substitute 1 letter.
I use http://rubular.com to practice my Ruby Regexes. Here's what I came up with http://rubular.com/r/fAQEDFVEVn
The regex is: /\b[a-z]/
It uses \b to find a word boundary, and finally we ask for one letter only with [a-zA-Z]
Finally we'll use sub to replace it with its upcased version:
"1. moody".sub /\b[a-z]/, &:upcase
=> "1. Moody"
Hope that helps.

Variable Declaration Regex

I'm trying to make a simple Ruby regex to detect a JavaScript Declaration, but it fails.
Regex:
lines.each do |line|
unminifiedvar = /var [0-9a-zA-Z] = [0-9];/.match(line)
next if unminifiedvar == nil #no variable declarations on the line
#...
end
Testing Line:
var testvariable10 = 9;
A variable name can have more than one character, so you need a + after the character-set [...]. (Also, JS variable names can contain other characters besides alphanumerics.) A numeric literal can have more than one character, so you want a + on the RHS too.
More importantly, though, there are lots of other bits of flexibility that you'll find more painful to process with a regular expression. For instance, consider var x = 1+2+3; or var myString = "foo bar baz";. A variable declaration may span several lines. It need not end with a semicolon. It may have comments in the middle of it. And so on. Regular expressions are not really the right tool for this job.
Of course, it may happen that you're parsing code from a particular source with a very special structure and can guarantee that every declaration has the particular form you're looking for. In that case, go ahead, but if there's any danger that the nature of the code you're processing might change then you're going to be facing a painful problem that really isn't designed to be solved with regular expressions.
[EDITED about a day after writing, to fix a mistake kindly pointed out by "the Tin Man".]
You forgot the +, as in, more than one character for the variable name.
var [0-9a-zA-Z]+ = [0-9];
You may also want to add a + after the [0-9]. That way it can match multiple digits.
var [0-9a-zA-Z]+ = [0-9]+;
http://rubular.com/r/kPlNcGRaHA
Try /var [0-9a-zA-Z]+ = \d+;/
Without the +, [0-9a-zA-Z] will only match a single alphanumeric character. With +, it can match 1 or more alphanumeric characters.
By the way, to make it more robust, you may want to make it match any number of spaces between the tokens, not just exactly one space each. You may also want to make the semicolon at the end optional (because Javascript syntax doesn't require a semicolon). You might also want to make it always match against the whole line, not just a part of the line. That would be:
/\Avar\s+[0-9a-zA-Z]+\s*=\s*\d+;?\Z/
(There is a way to write [0-9a-zA-Z] more concisely, but it has slipped my memory; if someone else knows, feel free to edit this answer.)

Ruby -- looking for some sort of "Regexp unescape" method

I have a bunch of string with special escape codes that I want to store unescaped- eg, the interpreter shows
"\\014\"\\000\"\\016smoothing\"\\011mean\"\\022color\"\\011zero#\\016"
but I want it to show (when inspected) as
"\014\"\000\"\016smoothing\"\011mean\"\022color\"\011zero#\016"
What's the method to unescape them? I imagine that I could make a regex to remove 1 backslash from every consecutive n backslashes, but I don't have a lot of regex experience and it seems there ought to be a "more elegant" way to do it.
For example, when I puts MyString it displays the output I'd like, but I don't know how I might capture that into a variable.
Thanks!
Edited to add context: I have this class that is being used to marshal / restore some stuff, but when I restore some old strings it spits out a type error which I've determined is because they weren't -- for some inexplicable reason -- stored as base64. They instead appear to have just been escaped, which I don't want, because trying to restore them similarly gives the TypeError
TypeError: incompatible marshal file format (can't be read)
format version 4.8 required; 92.48 given
because Marshal looks at the first characters of the string to determine the format.
require 'base64'
class MarshaledStuff < ActiveRecord::Base
validates_presence_of :marshaled_obj
def contents
obj = self.marshaled_obj
return Marshal.restore(Base64.decode64(obj))
end
def contents=(newcontents)
self.marshaled_obj = Base64.encode64(Marshal.dump(newcontents))
end
end
Edit 2: Changed wording -- I was thinking they were "double-escaped" but it was only single-escaped. Whoops!
If your strings give you the correct output when you print them then they are already escaped correctly. The extra backslashes you see are probably because you are displaying them in the interactive interpreter which adds extra backslashes for you when you display variables to make them less ambiguous.
> x
=> "\\"
> puts x
\
=> nil
> x.length
=> 1
Note that even though it looks like x contains two backslashes, the length of the string is one. The extra backslash is added by the interpreter and is not really part of the string.
If you still think there's a problem, please be more specific about how you are displaying the strings that you mentioned in your question.
Edit: In your example the only thing that need unescaping are octal escape codes. You could try this:
x = x.gsub(/\\[0-2][0-7]{2}/){ |c| c[1,3].to_i(8).chr }

Resources