What is an escape character in Ruby? - ruby

I would like to split lines which contains [ (bracket: []). However, when I type this as /[/ it is treated as comment.

You need to escape the [ char like /\[/.

I infer that you're using string.split, which can use a regex (the stuff between the / /) to indicate what delimiter character it will split the string into a list with.
Well, regexes use the [ and ] characters in a special way, to denote that such a group will match any of the characters inside.
[abc] => matches a, b, or c
Since you actually need to match the [ symbol literally, you need to escape it with the \ switch
So, write your split as:
string.split(/\[/)

Related

Why won't my simple regex pattern match and remove a file extension?

I have a string:
app_copy--28.ipa
The result I want is:
app_copy
The number after -- could be of variable length, so I want to match everything including and after --.
I've tried a few patterns, but none are matching for some reason:
gsub("--\*", "")
gsub("--*", "")
gsub("--*.ipa", "")
gsub("--\[0-9].ipa", "")
What am I missing?
Let's take a look at your test patterns:
"--\*" is actually equivalent to "--*" (since the \* is an escape sequence).
"--*" will match a single - character, followed by zero or more - characters.
"--*.ipa" will match a single - character, followed by zero or more - characters, followed by any single character, followed by a literal ipa.
"--\[0-9].ipa" is actually equivalent to "--[0-9].ipa" (since the \[ is an escape sequence), which will match a literal --, followed by a single decimal digit, followed by any single character, followed by a literal ipa.
However, none of these patterns would work as you used them because gsub will not treat it as a regular expression:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally…
You'd need to wrap type convert your pattern to a Regexp (using Regexp.new), or use a regular expression literal.
Try this pattern
--.*
This pattern will find any literal --, followed by zero or more of any character.
For example:
"app_copy--28.ipa".gsub(/--.*/, "") # app_copy
Don't use gsub to try to change the string, simply use a pattern to match the part you want:
"app_copy--28.ipa"[/^(.+?)--/, 1] # => "app_copy"
String's [] takes a lot of different types of parameters. You can pass in a pattern, and the index of the capture that you want, to extract just that part. From the documentation:
str[regexp, capture] → new_str or nil
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
How is this ?
str = "app_copy--28.ipa"
str[0..str.index("-")-1]
# => "app_copy"
str = "app_copy--28.ipa"
str.split("--").first
# => "app_copy"

String gsub - Replace characters between two elements, but leave surrounding elements

Suppose I have the following string:
mystring = "start/abc123/end"
How can you splice out the abc123 with something else, while leaving the "/start/" and "/end" elements intact?
I had the following to match for the pattern, but it replaces the entire string. I was hoping to just have it replace the abc123 with 123abc.
mystring.gsub(/start\/(.*)\/end/,"123abc") #=> "123abc"
Edit: The characters between the start & end elements can be any combination of alphanumeric characters, I changed my example to reflect this.
You can do it using this character class : [^\/] (all that is not a slash) and lookarounds
mystring.gsub(/(?<=start\/)[^\/]+(?=\/end)/,"7")
For your example, you could perhaps use:
mystring.gsub(/\/(.*?)\//,"/7/")
This will match the two slashes between the string you're replacing and putting them back in the substitution.
Alternatively, you could capture the pieces of the string you want to keep and interpolate them around your replacement, this turns out to be much more readable than lookaheads/lookbehinds:
irb(main):010:0> mystring.gsub(/(start)\/.*\/(end)/, "\\1/7/\\2")
=> "start/7/end"
\\1 and \\2 here refer to the numbered captures inside of your regular expression.
The problem is that you're replacing the entire matched string, "start/8/end", with "7". You need to include the matched characters you want to persist:
mystring.gsub(/start\/(.*)\/end/, "start/7/end")
Alternatively, just match the digits:
mystring.gsub(/\d+/, "7")
You can do this by grouping the start and end elements in the regular expression and then referring to these groups in in the substitution string:
mystring.gsub(/(?<start>start\/).*(?<end>\/end)/, "\\<start>7\\<end>")

Why do I get the Regexp warning "warning: nested repeat operator ? and * was replaced with '*'"

I have a regular expression for parsing Norwegian street addresses:
STREET_ADDRESS_PATTERN = <<-REGEX
^
(?<street_name>[\w\D\. ]+)\s+
(?<house_number>\d+)
(?<entrance>[A-Z])?\s*,\s*
(
(?<postal_code>\d{4})\s+
(?<city>[\w\D ]+)
)?
$
REGEX
It worked earlier, and I can't remember if I changed something, and in which case what I changed. In any case, now I'm getting this warning:
warning: nested repeat operator ? and * was replaced with '*'
And the match is returning nil. Can anybody see why I'm getting this warning?
Note: I'm currently using this (fake) address to test the expression: "Storgata 38H, 0273 Oslo".
Let's take a look at something you're doing to the poor regular expression engine:
(?<street_name>[\w\D\. ]+)\s+
The problem is inside the character class: [\w\D\. ]+. The following definitions are from Ruby's Regexp class documentation:
/\w/ - A word character ([a-zA-Z0-9_])
/\D/ - A non-digit character ([^0-9])
You're telling the engine to select:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
_
every character that is NOT 0123456789
. and spaces
In other words, every possible character. You'd do just as well to use:
(?<street_name>.+)
because that's going to be pretty greedy. This Rubular example shows your pattern is allowing the engine to capture everything thrown at it, including almost the entire string Storgata 38H, 0273 Oslo: http://rubular.com/r/nMfcB0cUdu
Also, \. inside [] is the same as [.] because the special use of period as a wildcard is escaped automatically inside the brackets. You don't need to escape it again to try to make it literal because it already is literal.
I'd strongly recommend using Rubular to take a look at each section of your regex, and try matching against several other possible addresses strings, and see if Rubular says the patterns are going to match what you expect. Once you've done that, try putting together the complete pattern. As is, I think your subsections are interacting and masking some problems that will come back to bite you later.
My hope was that [\w\D] would select all word characters except numbers... Any way to do that?
Ah. Let's dive into the documentation again:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
You want to use the /[[:alpha:]]/ pattern. As displayed it would capture only one character, but it'd be within any of the POSIX set of "letter" characters, which is the range you want:
[4] (pry) main: 0> 'æ, ø and å'.scan(/[[:alpha:]]/)
[
[0] "æ",
[1] "ø",
[2] "a",
[3] "n",
[4] "d",
[5] "å"
]
Here's a wee tweak:
[5] (pry) main: 0> 'æ, ø and å'.scan(/[[:alpha:]]+/)
[
[0] "æ",
[1] "ø",
[2] "and",
[3] "å"
]
Oh, now I see what I did. I replaced the ' delimiters of the string with <<-REGEX which means that all backslashes in the expression must now be escaped. Changing back to single ticks fixed the issue. After sepp2k's recommendation I further edited the Regex string into a literal:
STREET_ADDRESS_PATTERN = /
^
(?<street_name>[\w\D\. ]+)\s+
(?<house_number>\d+)
(?<entrance>[A-Z])?\s*,\s*
(
(?<postal_code>\d{4})\s+
(?<city>[\w\D ]+)
)?
$
/xi

How to match any quoted strings containing Cyrillic symbols

Need parse a lot of text files and replace any quoted strings containing cyrillic symbols. They are may contains new lines, non-alphabetic characters and special symbols (for example '$' or escaped quote).
Can anyone help with regex?
From comments:
for example php code
function hello($word) {
$word2 = "ха-ха!";
echo "Привет, $word $word2\n";
}
hello('Мир');
I need match "ха-ха!", "Привет, $word $word2\n" and 'Мир'
This should work:
str = 'The cat is under the "таблица"'
regex = /"\p{Cyrillic}+.*?\.?"/ui
str.match(regex){|s| do_stuff_with_each_matching s}
# or...
str.gsub!(regex){|s| method_that_translates_russian s}
Check it out on live at http://rubular.com/r/0Mwbfinjvp.
http://www.ruby-doc.org/core-1.9.3/Regexp.html
".*[^a-zA-Z\d]+.*" matches any quoted character sequence containing at least one non-alphanumeric character.
i.e. it matches "aa$bb" and "a1$b1"
It doesn't match "aabb" or a$b.
Hope that this is what you want (Add required escaping).

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Resources