How to use double brackets in a regular expression? - ruby

What do double square brackets mean in a regex? I am confused about the following examples:
/[[^abc]]/
/[^abc]/
I was testing using Rubular, but I didn't see any difference between the one with double brackets and single brackets.

Posix character classes use a [:alpha:] notation, which are used inside a regular expression like:
/[[:alpha:][:digit:]]/
You'll need to scroll down a ways to get to the Posix information in the link above. From the docs:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline,
carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
Ruby also supports the following non-POSIX character classes:
/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
/[[:ascii:]]/ - A character in the ASCII character set
# U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO"
/[[:digit:]]/.match("\u06F2") #=> #<MatchData "\u{06F2}">
/[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He">
/[[:xdigit:]][[:xdigit:]]/.match("A6") #=> #<MatchData "A6">

'[[' doesn't have any special meaning. [xyz] is a character class and will match a single x, y or z. The carat ^ takes all characters not in the brackets.
Removing ^ for simplicity, you can see that the first open bracket is being matched with the first close bracket and the second closed bracket is being used as part of the character class. The final close bracket is treated as another character to be matched.
irb(main):032:0> /[[abc]]/ =~ "[a]"
=> 1
irb(main):033:0> /[[abc]]/ =~ "a]"
=> 0
This appears to have the same result as your original in some cases
irb(main):034:0> /[abc]/ =~ "a]"
=> 0
irb(main):034:0> /[abc]/ =~ "a"
=> 0
But this is only because your regular expression is not looking for an exact match.
irb(main):036:0> /^[abc]$/ =~ "a]"
=> nil

Related

how to get formatted data from a array and convert it back to new array in ruby

I have a data array like below. I need to format it like shown
a = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
Need to format like
["EC006", "ED009", "AX009"]
arr = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
To merely extract the strings of interest, assuming the data is formatted correctly, we may write the following.
arr.map { |s| s[/(?<=\[)[^\]]*/] }
#=> ["EC006", "ED009", "AX009"]
See String#[] and Demo
In the regular expression (?<=\[) is a positive lookbehind that asserts the previous character is '['. The ^ at the beginning of the character class [^\]] means that any character other than ']' must be matched. Appending the asterisk ([^\]]*) causes the character class to be matched zero or more times.
Alternatively, we could use the regular expression
/\[\K[^\]]*/
where \K causes the beginning of the match to be reset to the current string location and all previously-matched characters to be discarded from the match that is returned.
To confirm the correctness of the formatting as well, use
arr.map { |s| s[/\A[1-9]\d{3} \[\K[A-Z]{2}\d{3}(?=]\z)/] }
#=> ["EC006", "ED009", "AX009"]
Demo
Note that at the link I replaced \A and \z with ^ and $, respectively, in order to test the regex against multiple strings.
This regular expression can be broken down as follows.
\A # match beginning of string
[1-9] # match a digit other than zero
\d{3} # match 3 digits
[ ] # match one space
\[ # match '['
\K # reset start of match to current stringlocation and discard
# all characters previously matched from match that is returned
[A-Z]{2} # match 2 uppercase letters
\d{3} # match 3 digits
(?=]\z) # positive lookahead asserts following character is
# ']' and that character is at the end of the string
In the above I placed a space character in a character class ([ ]) merely to make it visible to the reader.
Input
a = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
Code
p a.collect { |x| x[/\[(.*)\]/, 1] }
Output
["EC006", "ED009", "AX009"]

Regex: Match all hyphens or underscores not at the beginning or the end of the string

I am writing some code that needs to convert a string to camel case. However, I want to allow any _ or - at the beginning of the code.
I have had success matching up an _ character using the regex here:
^(?!_)(\w+)_(\w+)(?<!_)$
when the inputs are:
pro_gamer #matched
#ignored
_proto
proto_
__proto
proto__
__proto__
#matched as nerd_godess_of, skyrim
nerd_godess_of_skyrim
I recursively apply my method on the first match if it looks like nerd_godess_of.
I am having troubled adding - matches to the same, I assumed that just adding a - to the mix like this would work:
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
and it matches like this:
super-mario #matched
eslint-path #matched
eslint-global-path #NOT MATCHED.
I would like to understand why the regex fails to match the last case given that it worked correctly for the _.
The (almost) full set of test inputs can be found here
The fact that
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
does not match the second hyphen in "eslint-global-path" is because of the anchor ^ which limits the match to be on the first hyphen only. This regex reads, "Match the beginning of the line, not followed by a hyphen or underscore, then match one or more words characters (including underscores), a hyphen or underscore, and then one or more word characters in a capture group. Lastly, do not match a hyphen or underscore at the end of the line."
The fact that an underscore (but not a hyphen) is a word (\w) character completely messes up the regex. In general, rather than using \w, you might want to use \p{Alpha} or \p{Alnum} (or POSIX [[:alpha:]] or [[:alnum:]]).
Try this.
r = /
(?<= # begin a positive lookbehind
[^_-] # match a character other than an underscore or hyphen
) # end positive lookbehind
( # begin capture group 1
(?: # begin a non-capture group
-+ # match one or more hyphens
| # or
_+ # match one or more underscores
) # end non-capture group
[^_-] # match any character other than an underscore or hyphen
) # end capture group 1
/x # free-spacing regex definition mode
'_cats_have--nine_lives--'.gsub(r) { |s| s[-1].upcase }
#=> "_catsHaveNineLives--"
This regex is conventionally written as follows.
r = /(?<=[^_-])((?:-+|_+)[^_-])/
If all the letters are lower case one could alternatively write
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/).
map(&:capitalize).join
#=> "_catsHaveNineLives--"
where
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/)
#=> ["_cats", "have", "nine", "lives--"]
(?=[^_-]) is a positive lookahead that requires the characters on which the split is made to be followed by a character other than an underscore or hyphen
you can try the regex
^(?=[^-_])(\w+[-_]\w*)+(?=[^-_])\w$
see the demo here.
Switch _- to -_ so that - is not treated as a range op, as in a-z.

What is the difference between \d and \p{Digit}?

While I have been using \p{Alpha} and \p{Space} for quite some time in my regular expressions I just came across \p{Digit}, but I couldn't find any information about what the up- or downsides are compared to the normal \d that I normally use. What are the key differences between those to?
\d matches only ASCII digits, i.e. it is equivalent to the class [0-9]. \p{Digit} matches the same characters as \d plus any other Unicode character that represents a digit. For example to match the arabic zero (code point U+0660):
"\u0660"
# => "٠"
"\u0660" =~ /\d/
# => nil
"\u0660" =~ /\p{Digit}/
# => 0

Retain Umlaut Character when using Split in Ruby

Why does this code (that contains an umlaut):
text = "Some super text with a german umlaut Wirtschaftsprüfer"
words = text.split(/\W+/)
words.each do |w|
puts w
end
Return this result (that does not retain the previously-given umlaut):
=> Some
=> super
=> text
=> with
=> a
=> german
=> umlaut
=> Wirtschaftspr
=> fer
Is there a way I can retain an umlaut when splitting a string in Ruby 1.9+?
EDIT: I use ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin11.4.2]
[\W] just matches non word characters, i.e., it's equivalent to [^a-zA-Z0-9_], and so does not include (exclude?) special characters and diacritics. You can use
words = text.split(/[^[:word:]]/)
which matches all Unicode "word" characters, or
words = text.split(/[^\p{Latin}]/)
which matches characters in the Unicode Latin script.
Note that both of these will match special characters from other languages, not just German.
See http://www.ruby-doc.org/core-1.9.3/Regexp.html and look for (1) "Character Classes" and (2) "Character Properties."
You could replace /\W+/ by /\s+/ (\s matches space characters: space, tabs, new lines)
Why does this code [...] not retain the previously-given umlaut
Because \W matches a non-word ASCII character (i.e. not a-z, not A-Z, not 0-9 and not _) and ü is such a character.
Is there a way I can retain an umlaut when splitting a string in Ruby 1.9+?
Sure, you can for example split by whitespace, which is the default if no pattern is given:
"Müllmann Straßenverkehr Wirtschaftsprüfer".split
=> ["Müllmann", "Straßenverkehr", "Wirtschaftsprüfer"]
From Ruby doc:
/\W/ - A non-word character ([^a-zA-Z0-9_])
ü isn't a word character, so \W matches and splits there. \p{Lu} and \p{Ll} are ruby shorthands to unicode uper and lowercase characters so you could do:
text.split /[^\p{Ll}\p{Lu}]/
... and should split even the most exotic strings.
because you used /\W/ to split text which means anything not in this list:
a-zA-Z0-9
try split
[^\w\ü]
which is
^ not in
\w a-zA-Z0-9
\ü
(alternatively look at creating your own pattern which you can reuse)
http://ruby-doc.org/core-1.9.3/Regexp.html ref

Regexp using literal vs. escaped character

Is there any practical difference between a regexp using an escape character versus one using the literal character? I.e. are there any situations where matching with them will return different results?
Example in Ruby:
literal = Regexp.new("\t")
=> / /
escaped = Regexp.new("\\t")
=> /\t/
# They're different...
literal == escaped
=> false
# ...but they seem to match the same:
"Hello\tWorld".match(literal)
=> #<MatchData "\t">
"Hello\tWorld".match(escaped)
=> #<MatchData "\t">
No, not in the case of \t (or \n).
But it won't work in most other cases (e.g., escape sequences that either don't have a 1:1 equivalent in string escapes like \s or where the meaning differs like \b), so it's generally a good idea to use the escaped versions (or construct the regex using /.../ in the first place).

Resources