Why doesn't URI.escape escape single quotes? - ruby

Why doesn't URI.escape escape single quotes?
URI.escape("foo'bar\" baz")
=> "foo'bar%22%20baz"

For the same reason it doesn't escape ? or / or :, and so forth. URI.escape() only escapes characters that cannot be used in URLs at all, not characters that have a special meaning.
What you're looking for is CGI.escape():
require "cgi"
CGI.escape("foo'bar\" baz")
=> "foo%27bar%22+baz"

This is an old question, but the answer hasn't been updated in a long time. I thought I'd update this for others who are having the same problem. The solution I found was posted here: use ERB::Util.url_encode if you have the erb module available. This took care of single quotes & * for me as well.
CGI::escape doesn't escape spaces correctly (%20) versus plus signs.

According to the docs, URI.escape(str [, unsafe]) uses a regexp that matches all symbols that must be replaced with codes. By default the method uses REGEXP::UNSAFE. When this argument is a String, it represents a character set.
In your case, to modify URI.escape to escape even the single quotes you can do something like this ...
reserved_characters = /[^a-zA-Z0-9\-\.\_\~]/
URI.escape(YOUR_STRING, reserved_characters)
Explanation: Some info on the spec ...
All parameter names and values are escaped using the [rfc3986]
percent- encoding (%xx) mechanism. Characters not in the unreserved
character set ([rfc3986] section 2.3) must be encoded. characters in
the unreserved character set must not be encoded. hexadecimal
characters in encodings must be upper case. text names and values must
be encoded as utf-8 octets before percent-encoding them per [rfc3629].

I know this has been answered, but what I wanted was something slightly different, and I thought I might as well post it up: I wanted to keep the "/" in the url, but escape all the other non-standard characters. I did it thus:
#public filename is a *nix filepath,
#like `"/images/isn't/this a /horrible filepath/hello.png"`
public_filename.split("/").collect{|s| ERB::Util.url_encode(s)}.join("/")
=> "/images/isn%27t/this%20a%20/horrible%20filepath/hello.png"
I needed to escape the single quote as I was writing a cache invalidation for AWS Cloudfront, which didn't like the single quotes and expected them to be escaped. The above should make a uri which is more safe than the standard URI.escape but which still looks like a URI (CGI Escape breaks the uri format by escaping "/").

Related

Single backslash for Ruby gsub replacement value?

Does anyone know how to provide a single backslash as the replacement value in Ruby's gsub method? I thought using double backslashes for the replacement value would result in a single backslash but it results in two backslashes.
Example: "a\b".gsub("\", "\\")
Result: a\\b
I also get the same result using a block:
Example: "a\b".gsub("\"){"\\"}
Result: a\\b
Obviously I can't use a single backslash for the replacement value since that would just serve to escape the quote that follows it. I've also tried using single (as opposed to double) quotes around the replacement value but still get two backslashes in the result.
EDIT: Thanks to the commenters I now realize my confusion was with how the Rails console reports the result of the operation (i.e. a\\b). Although the strings 'a\b' and 'a\\b' appear to be different, they both have the same length:
'a\b'.length (3)
'a\\b'.length (3)
You can represent a single backslash by either "\\" or '\\'. Try this in irb, where
"\\".size
correctly outputs 1, showing that you indeed have only one character in this string, not 2 as you think. You can also do a
puts "\\"
Similarily, your example
puts("a\b".gsub("\", "\\"))
correctly prints
a\b

How to use escaped colon in thymeleaf th:text?

This is what happens when I try to use a colon in th:text:
and a backslash doesn't seem to fix it:
How can I use the colon symbol in th:text?
If you want to place a literal into th:text, you have to use single quotes: th:text="'7:00AM'". See documentation here.
(By contrast, something like this th:text="7_00AM" is valid - because it is a literal token. Such strings can only use a subset of characters, but do not need enclosing 's.)

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.
The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes
This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten
How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

Regex to validate strings having only characters (without special characters but with accented characters), blank spaces and numbers

I am using Ruby on Rails 3.0.9 and I would like to validate a string that can contain only characters (case insensitive characters), blank spaces and numbers.
More:
special characters are not allowed (eg: !"£$%&/()=?^) except - and _;
accented characters are allowed (eg: à, è, é, ò, ...);
The regex that I know from this question is ^[a-zA-Z\d\s]*$ but this do not validate special characters and accented characters.
So, how I should improve the regex?
I wrote the ^(?:[^\W_]|\s)*$ answer in the question you referred to (which actually would have been different if I'd known you wanted to allow _ and -). Not being a Ruby guy myself, I didn't realize that Ruby defaults to not using Unicode for regex matching.
Sorry for my lack of Ruby experience. What you want to do is use the u flag. That switches to Unicode (UTF-8), so accented characters are caught. Here's the pattern you want:
^[\w\s-]*$
And here it is in action at Rubular. This should do the trick, I think.
The u flag works on my original answer as well, though that one isn't meant to allow _ or - characters.
Something like ^[\w\s\-]*$ should validate characters, blank spaces, minus, and underscore.
Validation string only for not allowed characters. In this case |,<,>," and &.
^[^|<>\"&]*$

Backslash + captured group within Ruby regular expression

How do I excape a backslash before a captured group?
Example:
"foo+bar".gsub(/(\+)/, '\\\1')
What I expect (and want):
foo\+bar
what I unfortunately get:
foo\\1bar
How do I escape here correctly?
As others have said, you need to escape everything in that string twice. So in your case the solution is to use '\\\\\1' or '\\\\\\1'. But since you asked why, I'll try to explain that part.
The reason is that replacement sequence is being parsed twice--once by Ruby and once by the underlying regular expression engine, for whom \1 is its own escape sequence. (It's probably easier to understand with double-quoted strings, since single quotes introduce an ambiguity where '\\1' and '\1' are equivalent but '\' and '\\' are not.)
So for example, a simple replacement here with a captured group and a double quoted string would be:
"foo+bar".gsub(/(\+)/, "\\1") #=> "foo+bar"
This passes the string \1 to the regexp engine, which it understands as a reference to a capture group. In Ruby string literals, "\1" means something else entirely (ASCII character 1).
What we actually want in this case is for the regexp engine to receive \\\1. It also understands \ as an escape character, so \\1 is not sufficient and will simply evaluate to the literal output \1. So, we need \\\1 in the regexp engine, but to get to that point we need to also make it past Ruby's string literal parser.
To do that, we take our desired regexp input and double every backslash again to get through Ruby's string literal parser. \\\1 therefore requires "\\\\\\1". In the case of single quotes one slash can be omitted as \1 is not a valid escape sequence in single quotes and is treated literally.
Addendum
One of the reasons this problem is usually hidden is thanks to the use of /.+/ style regexp quotes, which Ruby treats in a special way to avoid the need to double escape everything. (Of course, this doesn't apply to gsub replacement strings.) But you can still see it in action if you use a string literal instead of a regexp literal in Regexp.new:
Regexp.new("\.").match("a") #=> #<MatchData "a">
Regexp.new("\\.").match("a") #=> nil
As you can see, we had to double-escape the . for it to be understood as a literal . by the regexp engine, since "." and "\." both evaluate to . in double-quoted strings, but we need the engine itself to receive \..
This happens due to a double string escaping. You should use 5 slashes in this case.
"foo+bar".gsub(/([+])/, '\\\\\1')
Adding \ two more times escapes this properly.
irb(main):011:0> puts "foo+bar".gsub(/(\+)/, '\\\\\1')
foo\+bar
=> nil

Resources