Retain Umlaut Character when using Split in Ruby - ruby

Why does this code (that contains an umlaut):
text = "Some super text with a german umlaut Wirtschaftsprüfer"
words = text.split(/\W+/)
words.each do |w|
puts w
end
Return this result (that does not retain the previously-given umlaut):
=> Some
=> super
=> text
=> with
=> a
=> german
=> umlaut
=> Wirtschaftspr
=> fer
Is there a way I can retain an umlaut when splitting a string in Ruby 1.9+?
EDIT: I use ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin11.4.2]

[\W] just matches non word characters, i.e., it's equivalent to [^a-zA-Z0-9_], and so does not include (exclude?) special characters and diacritics. You can use
words = text.split(/[^[:word:]]/)
which matches all Unicode "word" characters, or
words = text.split(/[^\p{Latin}]/)
which matches characters in the Unicode Latin script.
Note that both of these will match special characters from other languages, not just German.
See http://www.ruby-doc.org/core-1.9.3/Regexp.html and look for (1) "Character Classes" and (2) "Character Properties."

You could replace /\W+/ by /\s+/ (\s matches space characters: space, tabs, new lines)

Why does this code [...] not retain the previously-given umlaut
Because \W matches a non-word ASCII character (i.e. not a-z, not A-Z, not 0-9 and not _) and ü is such a character.
Is there a way I can retain an umlaut when splitting a string in Ruby 1.9+?
Sure, you can for example split by whitespace, which is the default if no pattern is given:
"Müllmann Straßenverkehr Wirtschaftsprüfer".split
=> ["Müllmann", "Straßenverkehr", "Wirtschaftsprüfer"]

From Ruby doc:
/\W/ - A non-word character ([^a-zA-Z0-9_])
ü isn't a word character, so \W matches and splits there. \p{Lu} and \p{Ll} are ruby shorthands to unicode uper and lowercase characters so you could do:
text.split /[^\p{Ll}\p{Lu}]/
... and should split even the most exotic strings.

because you used /\W/ to split text which means anything not in this list:
a-zA-Z0-9
try split
[^\w\ü]
which is
^ not in
\w a-zA-Z0-9
\ü
(alternatively look at creating your own pattern which you can reuse)
http://ruby-doc.org/core-1.9.3/Regexp.html ref

Related

Regex to find strings with only letters or numbers or both

I am searching for strings with only letters or numbers or both. How could I write a regex for that?
You can use following regex to check if the string contains letters and/or numbers
^[a-zA-Z0-9]+$
Explanation
^: Starts with
[]: Character class
a-zA-Z: Matches any alphabet
0-9: Matches any number
+: Matches previous characters one or more time
$: Ends with
RegEx101 Demo
"abc&#*(2743438" !~ /[^a-z0-9]/i # => false
"abc2743438" !~ /[^a-z0-9]/i # => true
This example let to avoid multiline anchors use (^ or $) (which may present a security risk) so it's better to use \A and \z, or to add the :multiline => true option in Rails.
Only letters and numbers:
/\A[a-zA-Z0-9]+\z/
Or if you want to leave - and _ chars also:
/\A[a-zA-Z0-9_\-]+\z/

Ruby Regular expression not matching properly

I am trying to creat a RegEx to find words that contains any vowel.
so far i have tried this
/(.*?\S[aeiou].*?[\s|\.])/i
but i have not used RegEx much so its not working properly.
for example if i input "test is 1234 and sky fly test1234"
it should match test , is, and, test1234 but showing
test, is,1234 and
if put something else then different output.
Alternatively you can also do something like:
"test is 1234 and sky fly test1234".split.find_all { |a| a =~ /[aeiou]/ }
# => ["test", "is", "and", "test1234"]
You could use the below regex.
\S*[aeiou]\S*
\S* matches zero or more non-space characters.
or
\w*[aeiou]\w*
It will solve:
\b\w*[aeiou]+\w*\b
https://www.debuggex.com/r/O-fU394iC5ErcSs7
or you can substitute \w by \S
\b\S*[aeiou]+\S*\b
https://www.debuggex.com/r/RNE6Y6q1q5yPJbe-
\b - a word boundary
\w - same as [_a-zA-Z0-9]
\S - a non-whitespace character
Try this:
\b\w*[aeiou]\w*\b
\b denotes a word boundry, so this regexp matches word bounty, zero or more letters, a vowel, zero or more letters and another word boundry

What is the difference between \d and \p{Digit}?

While I have been using \p{Alpha} and \p{Space} for quite some time in my regular expressions I just came across \p{Digit}, but I couldn't find any information about what the up- or downsides are compared to the normal \d that I normally use. What are the key differences between those to?
\d matches only ASCII digits, i.e. it is equivalent to the class [0-9]. \p{Digit} matches the same characters as \d plus any other Unicode character that represents a digit. For example to match the arabic zero (code point U+0660):
"\u0660"
# => "٠"
"\u0660" =~ /\d/
# => nil
"\u0660" =~ /\p{Digit}/
# => 0

Remove hex escape from string

I have the following hex as a string: "\xfe\xff". I'd like to convert this to "feff". How do I do this?
The closest I got was "\xfe\xff".inspect.gsub("\\x", ""), which returns "\"FEFF\"".
"\xfe\xff".unpack("H*").first
# => "feff"
You are dealing with what's called an escape sequence in your double quoted string. The most common escape sequence in a double quoted string is "\n", but ruby allows you to use other escape sequences in strings too. Your string, "\xfe\xff", contains two hex escape sequences, which are of the form:
\xNN
Escape sequences represent ONE character. When ruby processes the string, it notices the "\" and converts the whole hex escape sequence to one character. After ruby processes the string, there is no \x left anywhere in the string. Therefore, looking for a \x in the string is fruitless--it doesn't exist. The same is true for the characters 'f' and 'e' found in your escape sequences: they do not exist in the string after ruby processes the string.
Note that ruby processes hex escape sequences in double quoted strings only, so the type of string--double or single quoted--is entirely relevant. In a single quoted string, the series of characters '\xfe' is four characters long because there is no such thing as a hex escape sequence in a single quoted string:
str = "\xfe"
puts str.length #=>1
str = '\xfe'
puts str.length #=>4
Regexes behave like double quoted strings, so it is possible to use an entire escape sequence in a regex:
/\xfe/
When ruby processes the regex, then just like with a double quoted string, ruby converts the hex escape sequence to a single character. That allows you to search for the single character in a string containing the same hex escape sequence:
if "abc\xfe" =~ /\xfe/
If you pretend for a minute that the character ruby converts the escape sequence "\xfe" to is the character 'z', then that if statement is equivalent to:
if "abcz" =~ /z/
It's important to realize that the regex is not searching the string for a '\' followed by an 'x' followed by an 'f' followed by an 'e'. Those characters do not exist in the string.
The inspect() method allows you to see the escape sequences in a string by nullifying the escape sequences, like this:
str = "\\xfe\\xff"
puts str
--output:--
\xfe\xff
In a double quoted string, "\\" represents a literal backslash, while an escape sequence begins with only one slash.
Once you've nullified the escape sequences, then you can match the literal characters, like the two character sequence '\x'. But it's easier to just pick out the parts you want rather than matching the parts you don't want:
str = "\xfe\xff"
str = str.inspect #=> "\"\\xFE\\xFF\""
result = ""
str.scan /x(..)/ do |groups_arr|
result << groups_arr[0]
end
puts result.downcase
--output:--
feff
Here it is with gsub:
str = "\xfe\xff"
str = str.inspect #=>"\"\\xFE\\xFF\""
str.gsub!(/
"? #An optional quote mark
\\ #A literal '\'
x #An 'x'
(..) #Any two characters, captured in group 1
"? #An optional quote mark
/xm) do
Regexp.last_match(1)
end
puts str.downcase
--output:--
feff
Remember, a regex acts like a double quoted string, so to specify a literal \ in a regex, you have to write \\. However, in a regex you don't have to worry about a " being mistaken for the end of the regex, so you don't need to escape it, like you do in a double quoted string.
Just for fun:
str = "\xfe\xff"
result = ""
str.each_byte do |int_code|
result << sprintf('%x', int_code)
end
p result
--output:--
"feff"
Why are you calling inspect? That's adding the extra quotes..
Also, putting that in double quotes means the \x is interpolated. Put it in single quotes and everything should be good.
'\xfe\xff'.gsub("\\x","")
=> "feff"

How to use double brackets in a regular expression?

What do double square brackets mean in a regex? I am confused about the following examples:
/[[^abc]]/
/[^abc]/
I was testing using Rubular, but I didn't see any difference between the one with double brackets and single brackets.
Posix character classes use a [:alpha:] notation, which are used inside a regular expression like:
/[[:alpha:][:digit:]]/
You'll need to scroll down a ways to get to the Posix information in the link above. From the docs:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline,
carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
Ruby also supports the following non-POSIX character classes:
/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
/[[:ascii:]]/ - A character in the ASCII character set
# U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO"
/[[:digit:]]/.match("\u06F2") #=> #<MatchData "\u{06F2}">
/[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He">
/[[:xdigit:]][[:xdigit:]]/.match("A6") #=> #<MatchData "A6">
'[[' doesn't have any special meaning. [xyz] is a character class and will match a single x, y or z. The carat ^ takes all characters not in the brackets.
Removing ^ for simplicity, you can see that the first open bracket is being matched with the first close bracket and the second closed bracket is being used as part of the character class. The final close bracket is treated as another character to be matched.
irb(main):032:0> /[[abc]]/ =~ "[a]"
=> 1
irb(main):033:0> /[[abc]]/ =~ "a]"
=> 0
This appears to have the same result as your original in some cases
irb(main):034:0> /[abc]/ =~ "a]"
=> 0
irb(main):034:0> /[abc]/ =~ "a"
=> 0
But this is only because your regular expression is not looking for an exact match.
irb(main):036:0> /^[abc]$/ =~ "a]"
=> nil

Resources