Ruby 1.9: Regular Expressions with unknown input encoding - ruby

Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let's say my input happens to be UTF-16 encoded:
x = "foo<p>bar</p>baz"
y = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/
x.match(re)
=> #<MatchData "<p>bar</p>" 1:"bar">
y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)
My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:
if y.methods.include?(:encode) # Ruby 1.8 compatibility
if y.encoding.name != 'UTF-8'
y = y.encode('UTF-8')
end
end
y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">
However, this feels a little awkward to me, and I wanted to ask if there's a better way to do it.

As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?
Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.
# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end
# Inside code looping through lines of input.
# The variables 'regex' and 'line_encoding' should be initialized previously, to
# persist across loops.
if line.methods.include?(:encoding) # Ruby 1.8 compatibility
if line.encoding != last_encoding
regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
last_encoding = line.encoding
end
end
line.match(regex)
In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.

Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add
# encoding: utf-8
to the top of your rb file.

Related

In ruby, how do I turn a text representation of a byte in to a byte?

What is the best way to turn the string "FA" into /xFA/ ?
To be clear, I don't want to turn "FA" into 7065 or "FA".to_i(16).
In Java the equivalent would be this:
byte b = (byte) Integer.decode("0xFA");
So you're using / markers, but you aren't actually asking about regexps, right?
I think this does what you want:
['FA'].pack('H*')
# => "\xFA"
There is no actual byte type in ruby stdlib (I don't think? unless there's one I don't know about?), just Strings, that can be any number of bytes long (in this case, one). A single "byte" is typically represented as a 1-byte long String in ruby. #bytesize on a String will always return the length in bytes.
"\xFA".bytesize
# => 1
Your example happens not to be a valid UTF-8 character, by itself. Depending on exactly what you're doing and how you're environment is set up, your string might end up being tagged with a UTF-8 encoding by default. If you are dealing with binary data, and want to make sure the string is tagged as such, you might want to #force_encoding on it to be sure. It should NOT be neccesary when using #pack, the results should be tagged as ASCII-8BIT already (which has a synonym of BINARY, it's basically the "null encoding" used in ruby for binary data).
['FA'].pack('H*').encoding
=> #<Encoding:ASCII-8BIT
But if you're dealing with string objects holding what's meant to be binary data, not neccesarily valid character data in any encoding, it is useful to know you may sometimes need to do str.force_encoding("ASCII-8BIT") (or force_encoding("BINARY"), same thing), to make sure your string isn't tagged as a particular text encoding, which would make ruby complain when you try to do certain operations on it if it includes invalid bytes for that encoding -- or in other cases, possibly do the wrong thing
Actually for a regexp
Okay, you actually do want a regexp. So we have to take our string we created, and embed it in a regexp. Here's one way:
representation = "FA"
str = [representation].pack("H*")
# => "\xFA"
data = "\x01\xFA\xC2".force_encoding("BINARY")
regexp = Regexp.new(str)
data =~ regexp
# => 1 (matched on byte 1; the first byte of data is byte 0)
You see how I needed the force_encoding there on the data string, otherwise ruby would default to it being a UTF-8 string (depending on ruby version and environment setup), and complain that those bytes aren't valid UTF-8.
In some cases you might need to explicitly set the regexp to handle binary data too, the docs say you can pass a second argument 'n' to Regexp.new to do that, but I've never done it.

Anyone knows what's going on with "weird" characters at the beginning of a string?

I've just run into a strange issue when trying to detect a certain string among an array of them. Anyone knows what's going on here?
(rdb:1) p magic_string
"Time Period"
(rdb:1) p magic_string.class
String
(rdb:1) p magic_string == "Time Period"
false
(rdb:1) p "Time Period".length
11
(rdb:1) p magic_string.length
14
(rdb:1) p magic_string[0].chr
"\357"
(rdb:1) p magic_string[1].chr
"\273"
(rdb:1) p magic_string[2].chr
"\277"
(rdb:1) p magic_string[3].chr
"T"
Your string contains 3 bytes (BOM) at the beginning to indicate that encoding is UTF-8.
Q: What is a BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at
the beginning of a data stream, where it can be used as a signature
defining the byte order and encoding form, primarily of unmarked
plaintext files. Under some higher level protocols, use of a BOM may
be mandatory (or prohibited) in the Unicode data stream defined in
that protocol.
source
This might help you to understand what's happening:
# encoding: UTF-8
RUBY_VERSION # => "1.9.3"
magic_string = "Time Period"
magic_string[0].chr # => "\uFEFF"
The same output is true with Ruby v2.2.2.
Older versions of Ruby didn't default to UTF-8 and treated strings as an array of bytes. The encoding line is important to tell it what the script's strings' encoding is.
Ruby now correctly treats Strings as arrays of characters not bytes, which is why it reports the first character as "\uFEFF", a two-byte character.
"\uFEFF" and "\uFFFE" are BOM markers showing which "endian" the characters are. Endianness is tied to the CPU's idea of what a most significant and least significant byte is in a word (two bytes typically). This is also tied to Unicode, both of which are something you need to understand, at least in a rudimentary way as we don't deal with only ASCII any more, and languages don't consist of only the Latin character set.
UTF-8 is an multibyte character set that incorporates a huge number of characters from multiple languages. You can also run into UTF-16LE, UTF-16BE or longer; HTML and documents on the internet can be encoded in varying lengths of characters depending on the originating hardware and not being aware of those can drive you nuts and you'll go down the wrong paths trying to read their content. It's important to read the IO class documentation for "IO Encoding" to know the right way to deal with these types of files.

Split utf8 string regardless of ruby version

str = "é-du-Marché"
I get the first char via
str.split(//).first
How I can get the rest of the string regardless of my ruby version ?
String does not have a method first. So you need in addition a split. When you do the split in unicode-mode (exactly utf-8) you have acces to the first (and other characters).
My solution:
puts RUBY_VERSION
str = "é-du-Marché"
p str.split(//u, 2)
Test with ruby 1.9.2:
1.9.2
["\u00E9", "-du-March\u00E9"]
Test with ruby 1.8.6:
1.8.6
["\303\251", "-du-March\303\251"]
With first and last you get your results:
str.split(//u, 2).first is the first character
str.split(//u, 2).last is the string after the first character.
str[1..-1] should return you everything after the first digit normally.
The first number is the starting index, which is set to 1 to skip the first digit, the second is the length, which is set to -1, so ruby counts from the back
Note: that multibyte characters only work in Ruby 1.9. If you wish to mimic this behavior downwards, you'll have to loop over the bytes yourself and figure out what needs to be removed from the data, cause Ruby 1.8 does not support this.
UPDATE:
You could try this as well, but I can't guarantee that it will work for every multibyte char:
str = "é-du-Marché"
substring = str.mb_chars[1..-1]
the mb_chars is a proxy class that directs the call to the appropiate implementation when dealing with UTF-8, UTF-32 or UTF-16 encoding of characters (e.g. multibyte chars).
More detailed info can be found here : http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html
But I do not know if this exists in older rails versions
UPDATE2:
Ruby 1.8 treats any string just as a bunch of bytes, calling size() on it will return the amount of bytes that is used to store the data. To determine the characters regardless of the encoding try this:
char_array = str.scan(/./m)
substring = char_array[1..-1].join
This should do the trick normally. Try looking at http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 who explains how to treat multibyte data in older ruby versions.
EDIT3:
Playing around with the scan & join operations brings me closer to your problem & solution. I honestly don't have the time at to get the full solution working but if you play with the scan(/./mu) options you convert it to utf-8, which is supported by all ruby versions.

How to read only English characters

I am reading a file that sometimes has Chinese and characters of languages other than English.
How can I write a regex that only reads English words/letters?
Should it just be /^[a-zA-Z]+/?
If I do the above, then words like "eété" will still be picked but I don't want that:
"été".match(/^[a-zA-Z]+/) => #nil good I didn't want that word
"eété".match(/^[a-zA-Z]+/) => #not nil tricked into picking something I did not want
The only truly English letter that comes to mind is wynn ƿ.
One could make an argument for eth ð and thorn þ, but it would be a much weaker argument than can be made for wynn.
Other than those, English typically uses the Latin alphabet, albeit with certain modifications. Wynn possibly excepted, there is no such thing as an English letter, only a Latin letter.
There certainly exist regular expressions that demand that base characters be in the Latin or Common scripts, such as for example
(?:[\p{Script=Latin}\p{Script=Common}]\pM*+)+
but as you haven’t specified whether you’re using a 7- or 8-bit version of Ruby or a 21-bit version, I’m not sure what to tell you.
You need $, which means end of line:
/^[a-zA-Z]+$/
Or if you use such filtration:
strings.select { |s| /^[a-zA-Z]+$/ =~ s }
# which is equal to strings.grep /^[a-zA-Z]+$/
you can use negative filtration method with slightly simplier regex:
strings.reject { |s| /[^a-zA-Z]/ =~ s }
where [^a-zA-Z] means any non-english character.
Sometimes it's useful to use the Iconv library to deal with non-ASCII:
require 'iconv'
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") # !> encoding option isn't portable: TRANSLIT//IGNORE
utf8_to_ascii_translit = Iconv.new("ASCII//TRANSLIT", "UTF8") # !> encoding option isn't portable: TRANSLIT
utf8_to_ascii_ignore = Iconv.new("ASCII//IGNORE", "UTF8") # !> encoding option isn't portable: IGNORE
resume = "Résumé"
utf8_to_latin1.iconv(resume) # => "R\xE9sum\xE9"
utf8_to_ascii_translit.iconv(resume) # => "R'esum'e"
utf8_to_ascii_ignore.iconv(resume) # => "Rsum"
Notice that Ruby is warning that the option choices are not portable. That means there might be some damage to the string being processed; The "//TRANSLIT" and "//IGNORE" options can degrade the string but for our purpose it's OK.
James Gray wrote a nice article about Encoding Conversion With iconv, which is useful for understanding what Iconv can do, along with dealing with UTF-8 and Unicode characters.

Ruby 1.9 regex encoding

I am parsing this feed http://www.sixapart.com/labs/update/developers/ with nokogiri and then running some regex on the contents of some tags. The content is UTF-8 mostly, but is occasionally corrupt. However, for my case I don't really care and just need to pass the right parts of the content through, so I'm happy to treat the data as binary/ASCII-8BIT. The problem is that no matter what I do, regexes in my script are treated as either UTF-8 or ASCII. No matter what I set the encoding comment to, or what I do to create the regex.
Is there a solution to this? Can I force the regex to binary? Can I do a gsub without a regex easily? (I am just replacing & with &)
You need to encode the initial string and use the FIXEDENCODING option.
1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>
Strings have a property of encoding. Try to use method String#force_encoding before applying regex.
UPD: To make your regexp be ascii, look on accepted answer here: Ruby 1.9: Regular Expressions with unknown input encoding
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end

Resources