I need to clean up various Word 'smart' characters in user input, including but not limited to the following:
– EN DASH
‘ LEFT SINGLE QUOTATION MARK
’ RIGHT SINGLE QUOTATION MARK
Are there any Ruby functions or libraries for mapping these into their ASCII (near-) equivalents, or do I really need to just do a bunch of manual gsubs?
The HTMLEntities gem will decode the entities to UTF-8.
You could use iconv to transliterate to the closest ASCII equivalents or simple gsub or tr calls. James Grey has some blogs about converting between various character sets showing how to do the transliterations.
require 'htmlentities'
chars = [
'–', # EN DASH
'‘', # LEFT SINGLE QUOTATION MARK
'’' # RIGHT SINGLE QUOTATION MARK
]
decoder = HTMLEntities.new('expanded')
chars.each do |c|
puts "#{ c } => #{ decoder.decode(c) } => #{ decoder.decode(c).tr('–‘’', "-'")} => #{ decoder.decode(c).encoding }"
end
# >> – => – => - => UTF-8
# >> ‘ => ‘ => ' => UTF-8
# >> ’ => ’ => ' => UTF-8
Some gsubs sound like the best bet, especially if you're planning to load an entire extra library to do basically the same thing.
Related
I need to split a string using special characters like " < > = and white space.
Example of the string:
<fileset dir="/tmp/test/my_test" includes="all"/>
By now I tried using different combinations but no results
Example:
line.split(/<=>"\s+/).each do |line_parsed|
puts line_parsed
end
Regex is not the right tool for parsing XML. You can use any XML parser you like, here I am using Nokogiri:
require 'nokogiri'
doc = Nokogiri::XML(line)
fileset = doc.css('fileset').first
fileset.attr 'dir'
#=> "/tmp/test/my_test"
fileset.attr 'includes'
#=> "all"
If you have another loop à la each_line around the code you showed us, chances are you can drop this altogether and parse the whole document in one run.
Try enclosing the special characters in a "character class" ([...]) and moving the repetition character (+) outside:
parts = line.split(/[<=>"\s]+/)
# => ["", "fileset", "dir", "/tmp/test/my_test", "includes", "all", "/"]
parts[1] # => "fileset"
parts[2] # => "dir"
parts[3] # => "/tmp/test/my_test"
I came upon a strange character (using Nokogiri).
irb(main):081:0> sss.dump
=> "\"\\u{a0}\""
irb(main):082:0> puts sss
=> nil
irb(main):083:0> sss
=> " "
irb(main):084:0> sss =~ /\s/
=> nil
irb(main):085:0> sss =~ /[[:print:]]/
=> 0
irb(main):087:0> sss == ' '
=> false
irb(main):088:0> sss.length
=> 1
Any idea what is this strange character?
When it's displayed in a webpage, it's a white space, but it doesn't match a whitespace \s
using regular expression. Ruby even thinks it's a printable character!
How do I detect characters like this and exclude them or flag them as whitespace (if possible)?
Thanks
It's the non-breaking space. In HTML, it's used pretty frequently and often written as . One way to find out the identity of a character like "\u{a0}" is to search the web for U+00A0 (using four or more hexadecimal digits) because that's how the Unicode specification notates Unicode code points.
The non-breaking space and other things like it are included in the regex /[[:space:]]/.
So, I have
puts "test\\nstring".gsub(/\\n/, "\n")
and that works.
But how do I write one statement that replaces \n, \r, and \t with their correctly escaped counterparts?
You have to use backreferences. Try
puts "test\\nstring".gsub(/(\\[nrt])/, $1)
gsub sets $n (where 'n' is the number of the corresponding group in the regular expression used) to the content matched the pattern.
EDIT:
I modified the regexp, now the output should be:
test\nstring
The \n won't be intepreted as newline by puts.
Those aren't escaped characters, those are literal characters that are only represented as being escaped so they're human readable. What you need to do is this:
escapes = {
'n' => "\n",
'r' => "\r",
't' => "\t"
}
"test\\nstring".gsub(/\\([nrt])/) { escapes[$1] }
# => "test\nstring"
You will have to add other escape characters as required, and this still won't accommodate some of the more obscure ones if you really need to interpret them all. A potentially dangerous but really simple solution is to just eval it:
eval("test\\nstring")
So long as you can be assured that your input stream doesn't contain things like #{ ... } that would allow injecting arbitrary Ruby, which is possible if this is a one shot repair to fix some damaged encoding, this would be fine.
Update
There might be a mis-understanding as to what these backslashes are. Here's an example:
"\n".bytes.to_a
# => [10]
"\\n".bytes.to_a
# => [92, 110]
You can see these are two entirely different things. \n is a representation of ASCII character 10, a linefeed.
through the help of #tadman, and #black, I've discovered the solution:
>> escapes = {'\\n' => "\n", '\\t' => "\t"}
=> {"\\t"=>"\t", "\\n"=>"\n"}
>> "test\\nstri\\tng".gsub(/\\([nrt])/) { |s| escapes[s] }
=> "test\nstri\tng"
>> puts "test\\nstri\\tng".gsub(/\\([nrt])/) { |s| escapes[s] }
test
stri ng
=> nil
as it turns out, ya just map the \\ to \ and all is good. Also, you need to use puts for the terminal to output the whitespace correctly.
escapes = {'\\n' => "\n", '\\t' => "\t"}
puts "test\\nstri\\tng".gsub(/\\([nrt])/) { |s| escapes[s] }
Is there any way to prevent Ruby's JSON.pretty_generate() method from escaping a Unicode character?
I have a JSON object as follows:
my_hash = {"my_str" : "\u0423"};
Running JSON.pretty_generate(my_hash) returns the value as being \\u0423.
Is there any way to prevent this behaviour?
In your question you have a string of 6 unicode characters "\", "u", "0", "4", "2", "3" (my_hash = { "my_str" => '\u0423' }), not a string consisting of 1 "У" character ("\u0423", note double quotes).
According to RFC 4627, paragraph 2.5, backslash character in JSON string must be escaped, this is why your get double backslash from JSON.pretty_generate.
Alternatively, there are two-character sequence escape
representations of some popular characters. So, for example, a
string containing only a single reverse solidus character may be
represented more compactly as "\\".
char = unescaped /
escape (...
%x5C / ; \ reverse solidus U+005C
escape = %x5C ; \
Thus JSON ruby gem escape this character internally and there is no way to alter this behavior by parametrizing JSON or JSON.pretty_generate.
If you are interested in JSON gem implementation details - it defines internal mapping hash with explicit mapping of '' char:
module JSON
MAP = {
...
'\\' => '\\\\'
I took this code from a pure ruby variant of JSON gem gem install json_pure (note that there are also C extension variant that is distributed by gem install json).
Conclusion: If you need to unescape backslash after JSON genaration you need to implement it in your application logic, like in the code above:
my_hash = { "my_str" => '\u0423' }
# => {"my_str"=>"\\u0423"}
json = JSON.pretty_generate(my_hash)
# => "{\n \"my_str\": \"\\\\u0423\"\n}"
res = json.gsub "\\\\", "\\"
# => "{\n \"my_str\": \"\\u0423\"\n}"
Hope this helps!
Usually, hashes declared using rocket => rather than colon :. Also, there is alternative syntax for symbol-keyed hashes since 1.9: my_hash = {my_str: "\u0423"}. In this case, :my_str would be the key.
Anyway, on my computer JSON.pretty_generate works as expected:
irb(main):002:0> my_hash = {"my_str" => "\u0423"}
=> {"my_str"=>"У"}
irb(main):003:0> puts JSON.pretty_generate(my_hash)
{
"my_str": "У"
}
=> nil
Ruby 1.9.2p290, (built-in) json 1.4.2.
I need to replace certain ascii characters like # and & with their hex representations for a URL which would be 40 and 26 respectively.
How can I do this in ruby? there are also some characters most notably '-' which does not need to be replaced.
require 'uri'
URI.escape str, /[#&]/
Obviously, you can widen the regex with more characters you want to escape. Or, if you want to do a whitelisting approach, you can do, say,
URI.escape str, /[^-\w]/
This is ruby, so there's a mandatory 20 different ways to do it. Here's mine:
>> a = 'one&two%three'
=> "one&two%three"
>> a.gsub(/[&%]/, '&' => '&'.ord, '%' => '%'.ord)
=> "one38two37three"
I'm pretty sure Ruby has this functionality built in for URLs. However, if you want to define some more general translation facility you may use code like the following:
s = "h#llo world"
t = { " " => "%20", "#" => "%40" };
puts s.split(//).map { |c| t[c] || c }.join
Which would output
h%40llo%20world
In the above code, t is a hash defining the mapping from specific characters to their representation. The string is broken into characters and the hash is searched for each character's equivalent.
More generically and easily:
require 'uri'
URI.escape(your_string,Regexp.new("[^#{URI::PATTERN::UNRESERVED}]")