Which style of Ruby string quoting do you favour? - ruby

Which style of Ruby string quoting do you favour? Up until now I've always used 'single quotes' unless the string contains certain escape sequences or interpolation, in which case I obviously have to use "double quotes".
However, is there really any reason not to just use double quoted strings everywhere?

Don't use double quotes if you have to escape them. And don't fall in "single vs double quotes" trap. Ruby has excellent support for arbitrary delimiters for string literals:
Mirror of Site - https://web.archive.org/web/20160310224440/http://rors.org/2008/10/26/dont-escape-in-strings
Original Site -
http://rors.org/2008/10/26/dont-escape-in-strings

I always use single quotes unless I need interpolation.
Why? It looks nicer. When you have a ton of stuff on the screen, lots of single quotes give you less "visual clutter" than lots of double quotes.
I'd like to note that this isn't something I deliberately decided to do, just something that I've 'evolved' over time in trying to achieve nicer looking code.
Occasionally I'll use %q or %Q if I need in-line quotes. I've only ever used heredocs maybe once or twice.

Like many programmers, I try to be as specific as is practical. This means that I try to make the compiler do as little work as possible by having my code as simple as possible. So for strings, I use the simplest method that suffices for my needs for that string.
<<END
For strings containing multiple newlines,
particularly when the string is going to
be output to the screen (and thus formatting
matters), I use heredocs.
END
%q[Because I strongly dislike backslash quoting when unnecessary, I use %Q or %q
for strings containing ' or " characters (usually with square braces, because they
happen to be the easiest to type and least likely to appear in the text inside).]
"For strings needing interpretation, I use %s."%['double quotes']
'For the most common case, needing none of the above, I use single quotes.'
My first simple test of the quality of syntax highlighting provided by a program is to see how well it handles all methods of quoting.

I use single quotes unless I need interpolation. The argument about it being troublesome to change later when you need interpolation swings in the other direction, too: You have to change from double to single when you found that there was a # or a \ in your string that caused an escape you didn't intend.
The advantage of defaulting to single quotes is that, in a codebase which adopts this convention, the quote type acts as a visual cue as to whether to expect interpolated expressions or not. This is even more pronounced when your editor or IDE highlights the two string types differently.
I use %{.....} syntax for multi-line strings.

I usually use double quotes unless I specifically need to disable escaping/interpolation.

I see arguments for both:
For using mostly double quotes:
The github ruby style guideline advocates always using double quotes:
It's easier to search for a string foobar by searching for "foobar" if you were consistent with quoting. However, I'm not. So I search for ['"]foobar['"] turning on regexps.
For using some combination of single double quotes:
Know if you need to look for string interpolation.
Might be slightly faster (although so slight it wasn't enough to affect the github style guide).

I used to use single quotes until I knew I needed interpolation. Then I found that I was wasting a lot of time when I'd go back and have to change some single-quotes to double-quotes. Performance testing showed no measurable speed impact of using double-quotes, so I advocate always using double-quotes.
The only exception is when using sub/gsub with back-references in the replacement string. Then you should use single quotes, since it's simpler.
mystring.gsub( /(fo+)bar/, '\1baz' )
mystring.gsub( /(fo+)bar/, "\\1baz" )

I use single quotes unless I need interpolation, or the string contains single quotes.
However, I just learned the arbitrary delimiter trick from Dejan's answer, and I think it's great. =)

Single quote preserve the characters inside them. But double quotes evaluate and parse them. See the following example:
"Welcome #{#user.name} to App!"
Results:
Welcome Bhojendra to App!
But,
'Welcome #{#user.name} to App!'
Results:
Welcome #{#user.name} to App!

Related

Get the same results from string.start_with? and string[ ]

Basically, I want to check if a string (main) starts with another string (sub), using both of the above methods. For example, following is my code:
main = gets.chomp
sub = gets.chomp
p main.start_with? sub
p main[/^#{sub}/]
And, here is an example with I/O - Try it online!
If I enter simple strings, then both of them works exactly the same, but when I enter strings like "1\2" in stdin, then I get errors in the Regexp variant, as seen in TIO example.
I guess this is because of the reason that the string passed into second one isn't raw. So, I tried passing sub.dump into second one - Try it online!
which gives me nil result. How to do this correctly?
As a general rule, you should never ever blindly execute inputs from untrusted sources.
Interpolating untrusted input into a Regexp is not quite as bad as interpolating it into, say, Kernel#eval, because the worst thing an attacker can do with a Regexp is to construct an Evil Regex to conduct a Regular expression Denial of Service (ReDoS) attack (see also the section on Performance in the Regexp documentation), whereas with eval, they could execute arbitrary code, including but not limited to, deleting the entire file system, scanning memory for unencrypted passwords / credit card information / PII and exfiltrate that via the network, etc.
However, it is still a bad idea. For example, when I say "the worst thing that happen is a ReDoS", that assumes that there are no bugs in the Regexp implementation (Onigmo in the case of YARV, Joni in the case of JRuby and TruffleRuby, etc.) Ruby's Regexps are quite powerful and thus Onigmo, Joni and co. are large and complex pieces of code, and may very well have their own security holes that could be used by a specially crafted Regexp.
You should properly sanitize and escape the user input before constructing the Regexp. Thankfully, the Ruby core library already contains a method which does exactly that: Regexp::escape. So, you could do something like this:
p main[/^#{Regexp.escape(sub)}/]
The reason why your attempt at using String#dump didn't work, is that String#dump is for representing a String the same way you would have to write it as a String literal, i.e. it is escaping String metacharacters, not Regexp metacharacters and it is including the quote characters around the String that you need to have it recognized as a String literal. You can easily see that when you simply try it out:
sub.dump
#=> "\"1\\\\2\""
# equivalent to '"1\\2"'
So, that means that String#dump
includes the quotes (which you don't want),
escapes characters that don't need escaping in Regexp just because they need escaping in Strings (e.g. # or "), and
doesn't escape characters that don't need escaping in Strings (e.g. [, ., ?, *, +, ^, -).

Semantic differences between percent literals and herdocs in Ruby?

Looking at some documentation, I saw a multiline string defined using a percent literal:
command %Q{
do this;
do that;
}
In the past, I've always used heredocs when I needed multiline strings:
command <<-heredoc
echo "stuff" | do stuff;
heredoc
What are the semantic differences between them? Is there any reason why I would want to use %Q and not a heredoc?
I tend to evaluate how much text is being used when deciding which to use.
I use %Q when there's not a lot of text (for example, a single line), e.g. %Q|foobar|. The value that %Q provides, is it allows you to easily mix quotes, e.g.
%Q|"Get a Job" ~Mom's words|
I use "heredoc"s when there is a lot of text that spans multiple lines.
For example, suppose you're pasting a lot of text into a REPL (like the content of a YAML file). Unless you traverse the whole file, you can't be certain whether or not you will have a conflict with whatever %Q separator you have chosen. With a "heredoc" you just use some really obscure piece of text that you're fairly certain will not have a conflict, e.g.
<<-BatMobilePrettyObscure
... Lots of text ...
BatMobilePrettyObscure
As far as I know, semantically, there are just a few small differences:
%Q can only use one character to delimit strings
%Q can be multi-line or single-line
"heredoc"s must be Multi-line, with the closing "heredoc" standing alone
%Q delimiters can be "mashed" up against their strings, e.g. %Q|foobar|
There's a funky trick that you can use with heredocs: the first line can be used as if it was a complete string. For example, all of the following examples are valid Ruby code:
puts(<<-EOS)
Hello, world!
EOS
<<-EOS.upcase
Hello, world!
EOS
puts(<<-EOS.upcase)
Hello, world!
EOS
However, you will not find that very often in the wild. Other than that, they are the same as double quoted strings or %Q{} and %{} literals, except that you can choose multi-character delimiters. This comes in handy when all of the possible percent literal delimiters may occur in the string. This especially applies to long strings.
There isn't really a semantic difference, and it doesn't have to do with multiline strings either. All strings can be multiline in Ruby. These are all the same string:
'a
b
'
"a
b
"
%Q{a
b
}
<<-heredoc
a
b
heredoc
The question of which to use is decided by whether you need interpolation and the convenience of escaping characters. For example:
Do you need interpolation? If not then '' or %q()
Will there be lots of quote characters to escape? Then use %Q()
Do you want to write a lot of text without thinking about escaping characters? Use heredocs.

Matching an unescaped balanced pair of delimiters

How can I match a balanced pair of delimiters not escaped by backslash (that is itself not escaped by a backslash) (without the need to consider nesting)? For example with backticks, I tried this, but the escaped backtick is not working as escaped.
regex = /(?!<\\)`(.*?)(?!<\\)`/
"hello `how\` are` you"
# => $1: "how\\"
# expected "how\\` are"
And the regex above does not consider a backslash that is escaped by a backslash and is in front of a backtick, but I would like to.
How does StackOverflow do this?
The purpose of this is not much complicated. I have documentation texts, which include the backtick notation for inline code just like StackOverflow, and I want to display that in an HTML file with the inline code decorated with some span material. There would be no nesting, but escaped backticks or escaped backslashes may appear anywhere.
Lookbehind is the first thing everyone thinks of for this kind of problem, but it's the wrong tool, even in flavors like .NET that support unrestricted lookbehinds. You can hack something up, but it's going to be ugly, even in .NET. Here's a better way:
`[^`\\]*(\\.[^`\\]*)*`
The first part starts from the opening delimiter and gobbles up anything that's not the delimiter or a backslash. If the next character is a backslash, it consumes that and the character following it, whatever it may be. It could be the delimiter character, another backslash, or anything else, it doesn't matter.
It repeats those steps as many times as necessary, and when neither [^`\\] nor \\. can match, the next character must be the closing delimiter. Or the end of the string, but I'm assuming the input is well formed. But if it's not well formed, this regex will fail very quickly. I mention that because of this other approach I see a lot:
`(?:[^`\\]+|\\.)*`
This works fine on well-formed input, but what happens if you remove the last backtick from your sample input?
"hello `how\` are you"
According to RegexBuddy, after encountering the first backtick, this regex performed 9,252 distinct operations (or steps) before it could give up and report failure; mine failed in ten steps.
EDIT To extract just the par inside the delimiters, wrap that part in a capturing group. You'll still have to remove the backslashes manually.
`([^`\\]*(?:\\.[^`\\]*)*)`
I also changed the other group to non-capturing, which I should have done from the start. I don't avoid capturing religiously, but if you are using them to capture stuff, any other groups you use should be non-capturing.
EDIT I think I've been reading too much into the question. On StackOverflow, if you want to include literal backticks in an inline-code segment or a comment, you use three backticks as the the delimiter, not just one. Since there's no need to escape backticks, you can ignore backslashes as well. Your regex could turn out to be as simple as this:
```(.*?)```
Dealing with the possibility of false delimiters, you use the same basic technique:
```([^`]*(?:`(?!``)[^`]*)*)```
Is this what you're after?
By the way, this answer doesn't contradict #nneonneo's comment above. This answer doesn't consider the context in which the match is taking place. Is it in the source code of a program or web page? If it is, did the match occur inside a comment or a string literal? How do I even know the first backtick I found wasn't escaped? Regexes don't know anything about the context in which they operate; that's what parsers are for.
If you don't need nesting, regexes can indeed be a proper tool. Lexers of programming languages, for instance, use regexes to tokenize strings, and strings usually allow their own delimiters as an escaped content. Anything more complicated than that will probably need a full-blown parser though.
The "general formula" is to match an escaped character (\\.) or any character that's valid as content but don't need to be escaped ([^{list of invalid chars}]). A "naïve" solution would be joining them with or (|), but for a more efficient variant see #AlanMoore's answer.
The complete example is shown below, in two variants: the first assumes than backslashes should only be used for escaping inside the string, the second assumes that a backslash anywhere in the text escapes the next character.
`((?:\\.|[^`\\])*)`
(?:\\.|[^`\\])*`((?:\\.|[^`\\])*)`
Working examples here and here. However, as #nneonneo commented (and I endorsed), regexes are not meant to do a complete parse, so you'd better keep things simple if you want them to work out right (do you want to find a token in the text, or do you want to delimit it already knowing where it starts? The answer to that question is important to decide which strategy works best for your case).

Using regexes in ruby with a need to match lots of * and /

I need to find strings with * and / using reg-exes, I am writing in Ruby.The reason for this need to find lots of * and / is that I am building a tokenizer for an language and there are multi-line comments that use the C style of multi-line comments (/* */). I have the single line comments handled already.
Is there a way to use reg-ex without having to use the two foreword slashes to indicate some regular expression because I am finding it impossible to find my mistakes due to the insane amount of escaping. Or can someone give me advise on how to handle the escaping in a sane matter? I already tried writing the sequence first then escaping it.
Thank you for your time and advise.
One trick that might help is the %r literal:
%r{http://www\.google\.com}
I like to use pipes myself, when they're not in the regex.
%r|http://www\.google\.com|
You can also create new instances of Regexp via Regexp.new and pass a string.
Finally, you might also look at Regexp.quote:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.

What does %{} do in Ruby?

In Matt's post about drying up cucumber tests, Aslak suggests the following.
When I have lots of quotes, I prefer this:
Given %{I enter “#{User.first.username}” in “username”}
What is the %{CONTENT} construct called? Will someone mind referencing it in some documentation? I'm not sure how to go about looking it up.
There's also the stuff about %Q. Is that equivalent to just %? What of the curly braces? Can you use square braces? Do they function differently?
Finally, what is the #{<ruby stuff to be evaluated>} construct called? Is there a reference to that in documentation somewhere, too?
None of the other answers actually answer the question.
This is percent sign notation. The percent sign indicates that the next character is a literal delimiter, and you can use any (non alphanumeric) one you want. For example:
%{stuff}
%[stuff]
%?stuff?
etc. This allows you to put double quotes, single quotes etc into the string without escaping:
%{foo='bar with embedded "baz"'}
returns the literal string:
foo='bar with embedded "baz"'
The percent sign can be followed by a letter modifier to determine how the string is interpolated. For example, %Q[ ] is an interpolated String, %q[ ] is a non-interpolated String, %i[ ] is a non-interpolated Array of Symbols etc. So for example:
%i#potato tuna#
returns this array of Symbols:
[:potato, :tuna]
Details are here: Wikibooks
"Percent literals" is usually a good way to google some information:
http://www.sampierson.com/articles/ruby-percent-literals
http://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_.25_Notation
#{} is called "string interpolation".
The #{1+1} is called String Interpolation.
I, and Wikibooks, refer to the % stuff as just "% notation". Reference here. The % notation takes any delimiter, so long as it's non alphanumeric. It can also take modifiers (kind of like how regular expressions take options), one of which, interestingly enough, is whether you'll permit #{}-style string interpolation (this is also enabled by default).
% then does some special stuff to it, giving that notation some distinct, if a bit cryptic to beginners, terseness. For example %w{hello world} returns an array ['hello','world']. %s{hello} returns a symbol :hello.

Resources