ruby gsub new line characters - ruby

I have a string with newline characters that I want to gsub out for white space.
"hello I\r\nam a test\r\n\r\nstring".gsub(/[\\r\\n]/, ' ')
something like this ^ only my regex seems to be replacing the 'r' and 'n' letters as well. the other constraint is sometimes the pattern repeats itself twice and thus would be replaced with two whitespaces in a row, although this is not preferable it is better than all the text being cut apart.
If there is a way to only select the new line characters. Or even better if there a more rubiestic way of approaching this outside of going to regex?

If you have mixed consecutive line breaks that you want to replace with a single space, you may use the following regex solution:
s.gsub(/\R+/, ' ')
See the Ruby demo.
The \R matches any type of line break and + matches one or more occurrences of the quantified subpattern.
Note that in case you have to deal with an older version of Ruby, you will need to use the negated character class [\r\n] that matches either \r or \n:
.gsub(/[\r\n]+/, ' ')
or - add all possible linebreaks:
/gsub(/(?:\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])+/, ' ')

This should work for your test case:
"hello I\r\nam a test\r\n\r\nstring".gsub(/[\r\n]/, ' ')
If you don't want successive \r\n characters to result in duplicate spaces you can use this instead:
"hello I\r\nam a test\r\n\r\nstring".gsub(/[\r\n]+/, ' ')
(Note the addition of the + after the character class.)
As Wiktor mentioned, you're using \\ in your regex, which inside the regex literal /.../ actually escapes a backslash, meaning you're matching a literal backslash \, r, or n as part of your expression. Escaping characters works differently in regex literals, since \ is used so much, it makes no sense to have a special escape for it (as opposed to regular strings, which is a whole different animal).

Related

Ruby Regexp character class with new line, why not match?

I want to use this regex to match any block comment (c-style) in a string.
But why the below does not?
rblockcmt = Regexp.new "/\\*[.\s]*?\\*/" # match block comment
p rblockcmt=~"/* 22/Nov - add fee update */"
==> nil
And in addition to what Sir Swoveland posted, a . matches any character except a newline:
The following metacharacters also behave like character classes:
/./ - Any character except a newline.
https://ruby-doc.org/core-2.3.0/Regexp.html
If you need . to match a newline, you can specify the m flag, e.g. /.*?/m
Options
The end delimiter for a regexp can be followed by one or more
single-letter options which control how the pattern can match.
/pat/i - Ignore case
/pat/m - Treat a newline as a character matched by .
...
https://ruby-doc.org/core-2.3.0/Regexp.html
Because having exceptions/quirks like newline not matching a . can be painful, some people specify the m option for every regex they write.
It appears that you intend [.\s]*? to match any character or a whitespace, zero or more times, lazily. Firstly, whitespaces are characters, so you don't need \s. That simplifies your expression to [.]*?. Secondly, if your intent is to match any character there is no need for a character class, just write .. Thirdly, and most importantly, a period within a character class is simply the character ".".
You want .*? (or [^*]*).

Ruby - convert single backslash unicode to double backslash

I have a string that looks something like '\\u001E\\u001Csome_text\u001F' - where the first two characters are escaped with two backslashes and the last one only has one backslash.
I want to convert that string so that all the unicode literals have two backslashes in them, so the output would look like '\\u001E\\u001Csome_text\\u001F'. What ways could I go about doing this?
If the goal is to find instances of '\u' not preceded by a '\' character, then a negative lookbehind should suit the need to match:
(?<!\\)\\u
If you want to match the whole character, then you can also use a positive lookahead to verify the following characters:
(?<!\\)\\u(?=[0-9A-Z]{4})
Both of these will return the instances of '\u', which you can replace with '\\u'

How come you can't gsub this string in Ruby?

These \\n are showing up in my strings even though it should only be \n.
But if I do this :
"\n".gsub('\\n','\b')
It returns :
"\n"
Ideally, I'm trying to find a regex that could rewrite this string :
"R3pQvDqmz/EQ7zho2mhIeE6UB4dLa6GUH7173VEMdGCcdsRm5pernkqCgbnj\\nZjTX\\n"
To not display two backslashes, but just one like this :
"R3pQvDqmz/EQ7zho2mhIeE6UB4dLa6GUH7173VEMdGCcdsRm5pernkqCgbnj\nZjTX\n"
But any of the regex I do will not work. I can gsub out the \n and put something like X there, but if I put a \ in it, then Ruby escapes it with an additional \ which consequentially destroys my encryption module as it needs to be specific.
Any ideas?
You are falling into the trap of a different meaning of escapes when used in strings with double quotes vs single quotes. Double-quoted strings allow escape characters to be used. Thus, here "\n" actually is a one-character string containing a single line feed. Compare that to '\n' which is a two-character string containing a literal backslash followed by a character n.
This explains, whey your gsub doesn't match. If you use the following code, it should work:
"\\n".gsub('\n','\b')
For your actual issue, you can use this
string = "R3pQvDqmz/EQ7zho2mhIeE6UB4dLa6GUH7173VEMdGCcdsRm5pernkqCgbnj\\nZjTX\\n"
new_string = string.gsub("\\n", "\n")

How to allow string with letters, numbers, period, hyphen, and underscore?

I am trying to make a regular expression, that allow to create string with the small and big letters + numbers - a-zA-z0-9 and also with the chars: .-_
How do I make such a regex?
The following regex should be what you are looking for (explanation below):
\A[-\w.]*\z
The following character class should match only the characters that you want to allow:
[-a-zA-z0-9_.]
You could shorten this to the following since \w is equivalent to [a-zA-z0-9_]:
[-\w.]
Note that to include a literal - in your character class, it needs to be first character because otherwise it will be interpreted as a range (for example [a-d] is equivalent to [abcd]). The other option is to escape it with a backslash.
Normally . means any character except newlines, and you would need to escape it to match a literal period, but this isn't necessary inside of character classes.
The \A and \z are anchors to the beginning and end of the string, otherwise you would match strings that contain any of the allowed characters, instead of strings that contain only the allowed characters.
The * means zero or more characters, if you want it to require one or more characters change the * to a +.
/\A[\w\-\.]+\z/
\w means alphanumeric (case-insensitive) and "_"
\- means dash
\. means period
\A means beginning (even "stronger" than ^)
\z means end (even "stronger" than $)
for example:
>> 'a-zA-z0-9._' =~ /\A[\w\-\.]+\z/
=> 0 # this means a match
UPDATED thanks phrogz for improvement

Ruby RegEx problem text.gsub[^\W-], '') fails

I'm trying to learn RegEx in Ruby, based on what I'm reading in "The Rails Way". But, even this simple example has me stumped. I can't tell if it is a typo or not:
text.gsub(/\s/, "-").gsub([^\W-], '').downcase
It seems to me that this would replace all spaces with -, then anywhere a string starts with a non letter or number followed by a dash, replace that with ''. But, using irb, it fails first on ^:
syntax error, unexpected '^', expecting ']'
If I take out the ^, it fails again on the W.
>> text = "I love spaces"
=> "I love spaces"
>> text.gsub(/\s/, "-").gsub(/[^\W-]/, '').downcase
=> "--"
Missing //
Although this makes a little more sense :-)
>> text.gsub(/\s/, "-").gsub(/([^\W-])/, '\1').downcase
=> "i-love-spaces"
And this is probably what is meant
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
\W means "not a word"
\w means "a word"
The // generate a regexp object
/[^\W-]/.class
=> Regexp
Step 1: Add this to your bookmarks. Whenever I need to look up regexes, it's my first stop
Step 2: Let's walk through your code
text.gsub(/\s/, "-")
You're calling the gsub function, and giving it 2 parameters.
The first parameter is /\s/, which is ruby for "create a new regexp containing \s (the // are like special "" for regexes).
The second parameter is the string "-".
This will therefore replace all whitespace characters with hyphens. So far, so good.
.gsub([^\W-], '').downcase
Next you call gsub again, passing it 2 parameters.
The first parameter is [^\W-]. Because we didn't quote it in forward-slashes, ruby will literally try run that code. [] creates an array, then it tries to put ^\W- into the array, which is not valid code, so it breaks.
Changing it to /[^\W-]/ gives us a valid regex.
Looking at the regex, the [] says 'match any character in this group. The group contains \W (which means non-word character) and -, so the regex should match any non-word character, or any hyphen.
As the second thing you pass to gsub is an empty string, it should end up replacing all the non-word characters and hyphens with empty string (thereby stripping them out )
.downcase
Which just converts the string to lower case.
Hope this helps :-)
You forgot the slashes. It should be /[^\W-]/
Well, .gsub(/[^\W-]/,'') says replace anything that's a not word nor a - for nothing.
You probably want
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
Lower case \w (\W is just the opposite)
The slashes are to say that the thing between them is a regular expression, much like quotes say the thing between them is a string.

Resources