Ruby's string: Escape and unescape a custom character - ruby

Suppose I said £ character as dangerous, and I want to be able to protect and to unprotect any string. And vice versa.
Example 1:
"Foobar £ foobar foobar foobar." # => dangerous string
"Foobar \£ foobar foobar foobar." # => protected string
Example 2:
"Foobar £ foobar £££££££foobar foobar." # => dangerous string
"Foobar \£ foobar \£\£\£\£\£\£\£foobar foobar." # => protected string
Example 3:
"Foobar \£ foobar \\£££££££foobar foobar." # => dangerous string
"Foobar \£ foobar \\\£\£\£\£\£\£\£foobar foobar." # => protected string
Is there an easy way, with Ruby, to escape (and unescape) a given character (such as £ in my example) from a string?
Edit: here is an explication about the behavior of this question.
First of all, thanks for your answers. I have a Rails app with a Tweet model having a content field. Example of tweet:
tweet = Tweet.create(content: "Hello #bob")
Inside the model, there's a serialization process that converte the string like this:
dump('Hello #bob') # => '["Hello £", 42]'
# ... where 42 is the id of bob username
Then, I'm able to deserialize and display its tweet like this:
load('["Hello £", 42]') # => 'Hello #bob'
In the same way, it's also possible to do so with more than one username:
dump('Hello #bob and #joe!') # => '["Hello £ and £!", 42, 185]'
load('["Hello £ and £!", 42, 185]') # => 'Hello #bob and #joe!'
That's the goal :)
But this find-and-replace could be hard to perform with something like:
tweet = Tweet.create(content: "£ Hello #bob")
'cause here we also have to escape £ char. And I think your solution is good for this. So the result become:
dump('£ Hello #bob') # => '["\£ Hello £", 42]'
load('["\£ Hello £", 42]') # => '£ Hello #bob'
Just perfect. <3 <3
Now, if there is this:
tweet = Tweet.create(content: "\£ Hello #bob")
I think we first should escape every \, and then escape every £, like:
dump('\£ Hello #bob') # => '["\\£ Hello £", 42]'
load('["\\£ Hello £", 42]') # => '£ Hello #bob'
However... how can we do in this case:
tweet = Tweet.create(content: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\£ Hello #bob")
...where tweet.content.gsub(/(?<!\\)(?=(?:\\\\)*£)/, "\\") seems not working.

Hopefully your version of ruby supports lookbehinds. If it doesn't my solution will not work for you.
Escape characters :
str = str.gsub(/(?<!\\)(?=(?:\\\\)*£)/, "\\")
Un-escape characters :
str = str.gsub(/(?<!\\)((?:\\\\)*)\\£/, "\1£")
Both regexes will work regardless of the amount of backslashes. They are complementing each other.
Escape explanation :
"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
\\ # Match the character “\” literally
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
\\ # Match the character “\” literally
\\ # Match the character “\” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
£ # Match the character “£” literally
)
"
Not that I am matching a certain position. No text is consumed at all. When I pinpoint the position I want I insert a \.
Explanation of unescape :
"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
\\ # Match the character “\” literally
)
( # Match the regular expression below and capture its match into backreference number 1
(?: # Match the regular expression below
\\ # Match the character “\” literally
\\ # Match the character “\” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
\\ # Match the character “\” literally
£ # Match the character “£” literally
"
Here I am saving all the backslashes minus one and and I replace this number of backslashes with the special character. Tricky stuff :)

If you are using Ruby 1.9, which has lookbehind, then FailedDev's answer should work quite well. If you are using Ruby 1.8, which does not have lookbehind (I think), a different approach may work. Give this a try:
text.gsub!(/(\\.)|£)/m) do
if ($1 != nil) # If escaped anything
"$1" # replace with self.
else # Otherwise escape the
"\\£" # unescaped £.
end
end
Note that I am not a Ruby programmer and this snippet is untested (in particular I'm not sure if the: if ($1 != nil) statement usage is correct - it may need to be: if ($1 != "") or if ($1)), but I do know that this general technique (using code in place of a simple replacement string) works. I recently used this same technique for my JavaScript solution to a similar question which was looking to find unescaped asterisks.

I'm not sure if this is what you want, but I think you can do a simple find-and-replace:
str = str.gsub("£", "\\£") # to escape
str = str.gsub("\\£", "£") # to unescape
Note that I changed \ to \\ because you have to escape the backslash in a double-quoted string.
Edit: I think what you want is a regex that matches an odd number of backslashes:
str = str.gsub(/(^|[^\\])((?:\\\\)*)\\£/, "\\1\\2£")
That does the following transformations
"£" #=> "£"
"\\£" #=> "£"
"\\\\£" #=> "\\\\£"
"\\\\\\£" #=> "\\\\£"

Related

splitting a string misses the word which is used to split it

I have a string
a="Tamilnadu is far away from Kashmir"
If I split this string using "Tamilnadu", then I don't find Tamilnadu as a part of the array, I find empty string there, If I split the string "away" then away is not present in the result array, it's having empty string in the place of away. What should I do include it instead of having empty string.
Example
a="Tamilnadu is far away from Kashmir"
p a.split("Tamilnadu")
then Output is
["", " is far away from Kashmir"]
But I want
["Tamilnadu", " is far away from Kashmir"]
From docs:
If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well.
So... to split by "Tamilnadu" and keep it in the list, make it a capture group:
"Tamilnadu is far away from Kashmir".split(/(Tamilnadu)/)
# => ["", "Tamilnadu", " is far away from Kashmir"]
or, if you want to split after "Tamilnadu", make a zero-width match after it using lookbehind:
"Tamilnadu is far away from Kashmir".split(/(?<=Tamilnadu)/)
# => ["Tamilnadu", " is far away from Kashmir"]
If you don't know where "Tamilnadu" is in the string but you want to split the string before and after it, and not have any empty strings in the resulting array, you can use String#scan:
def split_it(str, substring)
str.scan(/\A.+(?= #{substring}\b)|\b#{substring}\b|(?<=\b#{substring} ).+/)
end
substring = "Tamilnadu"
split_it("Tamilnadu is far away from Kashmir", substring)
#=> ["Tamilnadu", "is far away from Kashmir"]
split_it("Far away is Tamilnadu from Kashmir", substring)
#=> ["Far away is", "Tamilnadu", "from Kashmir"]
split_it("Far away from Kashmir is Tamilnadu", substring)
#=> ["Far away from Kashmir is", "Tamilnadu"]
split_it("Far away is Daluth from Kashmir", substring)
#=> []
split_it("Far away is Tamilnaduland from Kashmir", substring)
#=> []
I've assumed that substring appears at most once in the string.
The regular expression can be written in free-spacing mode to make it self-documenting:
substring = "Tamilnadu"
/
\A.+ # match the beginning of the string followed by > 0 characters
(?=\ #{substring}\b) # match the value of substring preceded by a space and
# followed by a word break, in a positive lookahead
| # or
\b#{substring}\b # match the value of substring with a word break before and after
| # or
(?<=\b#{substring}\ ) # match the value of substring preceded by a word break
# and followed by a space, in a positive lookbehind
.+ # match > 0 characters
/x # free-spacing regex definition mode
#=>
/
\A.+ # ...
(?=\ Tamilnadu\b) # ...
| # ...
\bTamilnadu\b # ...
| # ...
(?<=\bTamilnadu\ ) # ...
.+ # ...
/x
Free-spacing mode removes all spaces before the regex is parsed, including spaces that may be intended to be part of the expression. It was for that reason that I escaped the two spaces. I could alternatively put each in a character class ([ ]) or use \s, [[:space:]] or \p{Space}, though they match whitespace, which is not quite the same.

How to replace Perl-style regex with MatchData object

I am using the gsub method with a regular expression:
#text.gsub(/(-\n)(\S+)\s/) { "#{$2}\n" }
Example of input data:
"The wolverine is now es-
sentially absent from
the southern end
of its European range."
should return:
"The wolverine is now essentially
absent from
the southern end
of its European range."
The method works fine, but rubocop reports and offense:
Avoid the use of Perl-style backrefs.
Any ideas how to rewrite it using MatchData object instead of $2?
If you want to use Regexp.last_match :
#text.gsub(/(-\n)(\S+)\s/) { Regexp.last_match[2] + "\n" }
or :
#text.gsub(/-\n(\S+)\s/) { Regexp.last_match[1] + "\n" }
Note that the block in gsub should be used when logic is involved. Without logic, a second parameter set to "\\1\n" or '\1' + "\n" would do just fine.
You can use backslash without the block:
#text.gsub /(-\n)(\S+)\s/, "\\2\n"
Also, it's a bit cleaner to use only one group, since the first one above isn't needed:
#text.gsub /-\n(\S+)\s/, "\\1\n"
This solution accounts for errant spaces before newlines and split words that end a sentence or the string. It uses String#gsub with a block and no capture groups.
Code
R = /
[[:alpha:]]\- # match a letter followed by a hyphen
\s*\n # match a newline possibly preceded by whitespace
[[:alpha:]]+ # match one or more letters
[.?!]? # possibly match a sentence terminator
\n? # possibly match a newline
\s* # match zero or more whitespaces
/x # free-spacing regex definition mode
def remove_hyphens(str)
str.gsub(R) { |s| s.gsub(/[\n\s-]/, '') << "\n" }
end
Examples
str =<<_
The wolverine is now es-
sentially absent from
the south-
ern end of its
European range.
_
puts remove_hyphens(str)
The wolverine is now essentially
absent from
the southern
end of its
European range.
puts remove_hyphens("now es- \nsentially\nabsent")
now essentially
absent
puts remove_hyphens("now es-\nsentially.\nabsent")
now essentially.
absent
remove_hyphens("now es-\nsentially?\n")
#=> "now essentially?\n" (no extra \n at end)

How to remove a certain character after substring in Ruby

I have a string with exclamation marks. I want to remove the exclamation marks at the end of the word, not the ones before a word. Assume there is no exclamation mark by itself/ not accompanied by a word. By word I mean [a..z], can be uppercased.
For example:
exclamation("Hello world!!!")
#=> ("Hello world")
exclamation("!!Hello !world!")
#=> ("!!Hello !world")
I have read How do I remove substring after a certain character in a string using Ruby? ; these two are close, but different.
def exclamation(s)
s.slice(0..(s.index(/\w/)))
end
# exclamation("Hola!") returns "Hol"
I have also tried s.gsub(/\w!+/, ''). Although it retains the '!' before word, it removes both the last letter and exclamation mark. exclamation("!Hola!!") #=> "!Hol".
How can I remove only the exclamation marks at the end?
If you don't want to use regex that sometimes difficult to understand use this:
def exclamation(sentence)
words = sentence.split
words_wo_exclams = words.map do |word|
word.split('').reverse.drop_while { |c| c == '!' }.reverse.join
end
words_wo_exclams.join(' ')
end
Although you haven't given a lot of test data, here's an example of something that might work:
def exclamation(string)
string.gsub(/(\w+)\!(?=\s|\z)/, '\1')
end
The \s|\z part means either a space or the end of the string, and (?=...) means to just peek ahead in the string but not actually match against it.
Note that this won't work in the case of things like "I'm mad!" where the exclamation mark is not adjacent to a space, but you could always add that as another potential end-of-word match.
"!!Hello !world!, world!! I say".gsub(r, '')
#=> "!!Hello !world, world! I say"
where
r = /
(?<=[[:alpha:]]) # match an uppercase or lowercase letter in a positive lookbehind
! # match an exclamation mark
/x # free-spacing regex definition mode
or
r = /
[[:alpha:]] # match an uppercase or lowercase letter
\K # discard match so far
! # match an exclamation mark
/x # free-spacing regex definition mode
If the above example should return "!!Hello !world, world I say", change ! to !+ in the regexes.

Extract all words with # symbol from a string

I need to extract all #usernames from a string(for twitter) using rails/ruby:
String Examples:
"#tom #john how are you?"
"how are you #john?"
"#tom hi"
The function should extract all usernames from a string, plus without special characters disallowed for usernames... as you see "?" in an example...
From "Why can't I register certain usernames?":
A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.
The \w metacharacter is equivalent to [a-zA-Z0-9_]:
/\w/ - A word character ([a-zA-Z0-9_])
Simply scanning for #\w+ will succeed according to that:
strings = [
"#tom #john how are you?",
"how are you #john?",
"#tom hi",
"#foo #_foo #foo_ #foo_bar #f123bar #f_123_bar"
]
strings.map { |s| s.scan(/#\w+/) }
# => [["#tom", "#john"],
# ["#john"],
# ["#tom"],
# ["#foo", "#_foo", "#foo_", "#foo_bar", "#f123bar", "#f_123_bar"]]
There are multiple ways to do it - here's one way:
string = "#tom #john how are you?"
words = string.split " "
twitter_handles = words.select do |word|
word.start_with?('#') && word[1..-1].chars.all? do |char|
char =~ /[a-zA-Z1-9\_]/
end && word.length > 1
end
The char =~ regex will only accept alphaneumerics and the underscore
r = /
# # match character
[[[:alpha:]]]+ # match one or more letters
\b # match word break
/x # free-spacing regex definition mode
"#tom #john how are you? And you, #andré?".scan(r)
#=> ["#tom", "#john", "#andré"]
If you wish to instead return
["tom", "john", "andré"]
change the first line of the regex from # to
(?<=#)
which is a positive lookbehind. It requires that the character "#" be present but it will not be part of the match.

Ruby regex extracting words

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan
For instance, the string:
' hello "my name" is "Tom"'
should match the words:
hello
my name
is
Tom
I managed to match the words enclosed in double quotes by using:
/"([^\"]*)"/
but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.
Any help with this would be appreciated!
result = ' hello "my name" is "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
will work for you. It will print
=> ["", "hello", "\"my name\"", "is", "\"Tom\""]
Just ignore the empty strings.
Explanation
"
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character)
)
"
You can use reject like this to avoid empty strings
result = ' hello "my name" is "Tom"'
.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
prints
=> ["hello", "\"my name\"", "is", "\"Tom\""]
text = ' hello "my name" is "Tom"'
text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}
Produces:
hello
my name
is
Tom
Explanation:
0 or more spaces followed by
either
some words within double-quotes OR
a single word
followed by 0 or more spaces
You can try this regex:
/\b(\w+)\b/
which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

Resources