How to unescape/decode escaped HTML characters? - ruby

When I using nokogiri to parser htmls, the Chinese characters are transfer to escaped sequences like
"巅峰延时"
How could I decode the escaped characters like "巅峰延时" back to normal characters?

It looks like your HTML page is encoded as UTF-8 but you are parsing as ISO-8859-1. You need to ensure you specify the correct encoding when parsing. If you are parsing from a string Nokogiri should use the same encoding as the string. If you are parsing from an IO object you can specify the encoding as the third argument to the parse method:
Nokogiri::HTML::Document.parse(io_object, nil, 'UTF-8')

What should the normal characters be though? This looks like their string representations.
Otherwise you have CGI.unescapeHTML() and CGI.escapeHTML() available in standard ruby (stdlib).

Related

Ruby: How to decode strings which are partially encoded or fully encoded?

I am getting encoded strings while parsing text files. I have no idea on how to decode them to english or it's original language.
"info#cloudag.com"
is the encoded string and needs to have it decoded.
I want to decode using Ruby.
Here is a link for your reference and I am expecting the same.
This looks like HTML encoding, not URL encoding.
require 'cgi'
CGI.unescapeHTML("info#cloudag.com")
#=> "info#cloudag.com"

How to address Compatibility Error with ruby

I have a ruby program that parses a large block of text with a number of regular expressions. The problem I'm having is that anytime the text contains 'special characters' (for example Kuutõbine or Noël) the program throws an Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) How do I force the proper encoding?
Your Regex is being "compiled" as ASCII-8BIT.
Just add the encoding declaration at the top of the file where the Regex is declared:
encoding: utf-8
And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.

Rails 3.2.21 / ruby 1.9.3 how can I \u encode unicode chars within a string

I need to sanitize some text sent to an email service provider (Sendgrid) that does not support unicode in the recipient name unless it is \u escaped.
When the UTF-8 string s = "Pablö" how can I "\u escape" any unicode inside the string so I get "Pabl\u00f6" ?
Converting to JSON also escapes the quotes (which I don't want):
"Pablö".to_json
=> "\"Pabl\\u00f6\""
What I'm looking for is something just like .force_encoding('binary') except for Unicode. Inspecting Encoding.aliases.values.uniq I don't see anything like 'unicode'.
I'm going to assume that everything is UTF-8 because we're not cavemen banging rocks together.
to_json isn't escaping quotes, it is adding quotes inside the string (because JSON requires strings to be quoted) and then inspect escapes them (and the backslash).
These quotes from to_json should always be there so you could just strip them off:
"Pablö".to_json[1..-2] # Lots of ways to do this...
=> "Pabl\\u00f6"
Keep in mind, however, that the behavior of to_json and UTF-8 depends on which JSON library you're using and possibly other things. For example, in my stock Ruby 2.2, the standard JSON library leaves UTF-8 alone; the JSON specification is quite happy with UTF-8 so why bother encoding it? So you might want to do it yourself with something like:
s.chars.map { |c| c.ord > 127 ? '\u%.4x' % c.ord : c }.join
Anything above 127 is out of ASCII range so that simple ord test takes care of anything like ö, ñ, µ, ... You'll want to adjust the map block if you need to encode other characters (such as \n).

Ruby, pack encoding (ASCII-8BIT that cannot be converted to UTF-8)

puts "C3A9".lines.to_a.pack('H*').encoding
results in
ASCII-8BIT
but I prefer this text in UTF-8. But
"C3A9".lines.to_a.pack('H*').encode("UTF-8")
results in
`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
why? How can I convert it to UTF-8?
You're going about this the wrong way. If you have URI encoded data like this:
%C5%BBaba
Then you should use URI.unescape to decode it:
1.9.2-head :004 > URI.unescape('%C5%BBaba')
=> "Żaba"
If that doesn't work then force the encoding to UTF-8:
1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
=> "Żaba"
ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.
If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:
utf_char=[code.to_i(16)].pack("U")
Combine these with the rest of the string, you can make your string.

Getting a meaning of the string

I have the following string "\u3048\u3075\u3057\u3093". I got the string
from a web page as part of returned data in JSONP.
What is that? It looks like UTF8, but then should it look like "U+3048U+3075U+3057U+3093"?
What's the meaning of the backslashes (\)?
How can I convert it to a human-readable form?
I'm looking to a solution with Ruby, but any explanation of what's going on here is appreciated.
The U+3048 syntax is normally used to represent the Unicode code point of a character. Such code point is fixed and does not depend on the encoding (UTF-8, UTF-32...).
A JSON string is composed of Unicode characters except double quote, backslash and those in the U+0000 to U+001F range (control characters). Characters can be represented with a escape sequence starting with \u and followed by 4 hexadecimal digits that represent the Unicode code point of the character. This is the JavaScript syntax (JSON is a subset of it). In JavaScript, the backslash is used as escape char.
It is Unicode, but not in UTF-8, it is in UTF-16. You might ignore surrogate pairs and deem it as 4-digit hexadecimal code points of a Unicode code character.
Using Ruby 1.9:
require 'json'
puts JSON.parse("[\"\\u4e00\",\"\\u4e8c\"]")
Prints:
一
二
Unicode characters in JSON are escaped as backslash u followed by four hex digits. See the string production on json.org.
Any JSON parser will convert it to the correct representation for your platform (if it doesn't, then by definition it is not a JSON parser)

Resources