Prevent JSON pretty_generate from escaping Unicode - ruby

Is there any way to prevent Ruby's JSON.pretty_generate() method from escaping a Unicode character?
I have a JSON object as follows:
my_hash = {"my_str" : "\u0423"};
Running JSON.pretty_generate(my_hash) returns the value as being \\u0423.
Is there any way to prevent this behaviour?

In your question you have a string of 6 unicode characters "\", "u", "0", "4", "2", "3" (my_hash = { "my_str" => '\u0423' }), not a string consisting of 1 "У" character ("\u0423", note double quotes).
According to RFC 4627, paragraph 2.5, backslash character in JSON string must be escaped, this is why your get double backslash from JSON.pretty_generate.
Alternatively, there are two-character sequence escape
representations of some popular characters. So, for example, a
string containing only a single reverse solidus character may be
represented more compactly as "\\".
char = unescaped /
escape (...
%x5C / ; \ reverse solidus U+005C
escape = %x5C ; \
Thus JSON ruby gem escape this character internally and there is no way to alter this behavior by parametrizing JSON or JSON.pretty_generate.
If you are interested in JSON gem implementation details - it defines internal mapping hash with explicit mapping of '' char:
module JSON
MAP = {
...
'\\' => '\\\\'
I took this code from a pure ruby variant of JSON gem gem install json_pure (note that there are also C extension variant that is distributed by gem install json).
Conclusion: If you need to unescape backslash after JSON genaration you need to implement it in your application logic, like in the code above:
my_hash = { "my_str" => '\u0423' }
# => {"my_str"=>"\\u0423"}
json = JSON.pretty_generate(my_hash)
# => "{\n \"my_str\": \"\\\\u0423\"\n}"
res = json.gsub "\\\\", "\\"
# => "{\n \"my_str\": \"\\u0423\"\n}"
Hope this helps!

Usually, hashes declared using rocket => rather than colon :. Also, there is alternative syntax for symbol-keyed hashes since 1.9: my_hash = {my_str: "\u0423"}. In this case, :my_str would be the key.
Anyway, on my computer JSON.pretty_generate works as expected:
irb(main):002:0> my_hash = {"my_str" => "\u0423"}
=> {"my_str"=>"У"}
irb(main):003:0> puts JSON.pretty_generate(my_hash)
{
"my_str": "У"
}
=> nil
Ruby 1.9.2p290, (built-in) json 1.4.2.

Related

How to keep single backslash in Ruby string after to_json formating?

I need to encode some hash containing URL string. I use to_json method and I need backslash in front of each slash (as PHP print such strings).
For example:
hash = {"url":"http:\\/\\/example.com\\/test"}
hash.to_json
The result is
{:url=>"http:\\/\\/example.com\\/test"}
While I need (and PHP's json_encode returns string with a single backslash).
{:url=>"http:\/\/example.com\/test"}
It's very important to keep the string as in PHP in case of encoding. Because strings with double and single backslashes get different results.
UPD:
The problem is not in communication. I need to encode my JSON using HMAC (SHA384). And the result is different in PHP and Ruby when I'm using URL strings. If the string doesn't contain backslash all works fine...
PHP implementation introduces the backslashes. JSON using by PHP looks so {"url":"http:\/\/example.com\/test"} while Ruby's JSON is {"url":"http:\\/\\/example.com\\/test"}
My apologies, you do seem to have a valid issue on your hand. The key is this: Why is the slash an escapable character in JSON? and its duplicate target, JSON: why are forward slashes escaped?. Since both unescaped slashes and escaped slashes are allowed, Ruby chose to not escape them, and PHP chose to escape them, and both approaches are correct.
(Aside: there's a bit of a complication in talking about this because \ is an escape character both for a string literal, and for JSON strings. Thus, in this answer, I take care to puts (or echo/print_r) all the values, to see the strings that do not have the string literal backslash escapes, only the backslashes that are actually present in the strings.)
Thus, the JSON {"url":"http:\/\/example.com\/test"} is a representation of the Ruby hash { 'url' => 'http://example.com/test' }, where slashes are escaped (as PHP's json_encode would do it). Ruby's to_json' would render that as{"url":"http://example.com/test"}`:
# Ruby
json1 = '{"url":"http:\/\/example.com\/test"}'
puts json1 # => {"url":"http:\/\/example.com\/test"}
puts JSON.parse(json1) # => {"url"=>"http://example.com/test"}
puts JSON.parse(json1).to_json # => {"url":"http://example.com/test"}
# PHP
$json1 = '{"url":"http:\/\/example.com\/test"}';
echo $json1; # {"url":"http:\/\/example.com\/test"}
print_r(json_decode($json1)); # stdClass Object
# (
# [url] => http://example.com/test
# )
echo json_encode(json_decode($json1)); # {"url":"http:\/\/example.com\/test"}
On the other hand, {"url":"http:\\/\\/example.com\\/test"} (represented in Ruby and PHP as the string '{"url":"http:\\\\/\\\\/example.com\\\\/test"}') is a representation of the Ruby hash { 'url' => 'http:\/\/example.com\/test' }, where there are actual backslashes, but the slashes are not escaped. PHP's json_encode would render this value as {"url":"http:\\\/\\\/example.com\\\/test"}.
# Ruby
json2 = '{"url":"http:\\\\/\\\\/example.com\\\\/test"}'
puts json2 # => {"url":"http:\\/\\/example.com\\/test"}
puts JSON.parse(json2) # => {"url"=>"http:\\/\\/example.com\\/test"}
puts JSON.parse(json2).to_json # => {"url":"http:\\/\\/example.com\\/test"}
# PHP
$json2 = '{"url":"http:\\\\/\\\\/example.com\\\\/test"}';
echo $json2; # {"url":"http:\/\/example.com\/test"}
print_r(json_decode($json2)); # stdClass Object
# (
# [url] => http:\/\/example.com\/test
# )
echo json_encode(json_decode($json2)); # {"url":"http:\\\/\\\/example.com\\\/test"}
PHP json_encode has an option to prevent the PHP's default of escaping of backslashes:
# PHP
echo json_encode('/'); # "\/"
echo json_encode('/', JSON_UNESCAPED_SLASHES); # "/"
Ruby does not have a similar option to force escaping of slashes, but since a slash has no special meaning in JSON, we can just manually replace / with \/:
# Ruby
puts '/'.to_json # "/"
puts '/'.to_json.gsub('/', '\/') # "\/"
Use single quotes around strings if you don't want to deal with escaping backslashes.
hash = { url: 'http:\/\/example.com\/test' }
json = hash.to_json
puts json
# => {"url":"http:\\/\\/example.com\\/test"}
Just a quick reminder: in JSON, backslashes need to be escaped because they are considered as control characters.
This way, when PHP parses this JSON document, you will get your string with a single backslash before each slash.
The problem behind your question is probably the real problem. I'm not sure because your question is not totally clear to me so I'm taking a guess/assumption here with my answer.
My assumption here is that you want to communicate between ruby and php, with json.
Well, in that case your don't have to have a problem (with backslashes).
Let ruby .to_json (JSON.generate(..)) and JSON.parse(..) solve the ruby part and let json_encode() and json_decode() solve the the php part and you are done.
so in ruby:
- don't use extra escaping-backslashes, let .to_json solve that for you
- use the literal url string you would type in your browser like so:
hash = {"url":"http://example.com/test"} # hash is a ruby object
puts hash.to_json # => {"url":"http://example.com/test"} is JSON (string)
then in php:
var_dump( json_decode('{"url": "http://example.com/test"}') );
gives you:
object(stdClass)#1 (1) {
["url"]=>
string(23) "http://example.com/test"
}
var_dump( json_decode('{"url": "http:\/\/example.com\/test"}') );
gives you:
object(stdClass)#1 (1) {
["url"]=>
string(23) "http://example.com/test"
}
Note that both JSON strings end up to be parsed correctly in PHP and end up as a normal PHP object
Try like below
hash = {"url":"http:\\/\\/example.com\\/test"}
hash[:url] = hash[:url].delete("\\")
hash.to_json #"{\"url\":\"http://example.com/test\"}"
Hope it will helps you

Use ARGV[] argument vector to pass a regular expression in Ruby

I am trying to use gsub or sub on a regex passed through terminal to ARGV[].
Query in terminal: $ruby script.rb input.json "\[\{\"src\"\:\"
Input file first 2 lines:
[{
"src":"http://something.com",
"label":"FOO.jpg","name":"FOO",
"srcName":"FOO.jpg"
}]
[{
"src":"http://something123.com",
"label":"FOO123.jpg",
"name":"FOO123",
"srcName":"FOO123.jpg"
}]
script.rb:
dir = File.dirname(ARGV[0])
output = File.new(dir + "/output_" + Time.now.strftime("%H_%M_%S") + ".json", "w")
open(ARGV[0]).each do |x|
x = x.sub(ARGV[1]),'')
output.puts(x) if !x.nil?
end
output.close
This is very basic stuff really, but I am not quite sure on how to do this. I tried:
Regexp.escape with this pattern: [{"src":".
Escaping the characters and not escaping.
Wrapping the pattern between quotes and not wrapping.
Meditate on this:
I wrote a little script containing:
puts ARGV[0].class
puts ARGV[1].class
and saved it to disk, then ran it using:
ruby ~/Desktop/tests/test.rb foo /abc/
which returned:
String
String
The documentation says:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ‘d’, instead of a digit.
That means that the regular expression, though it appears to be a regex, it isn't, it's a string because ARGV only can return strings because the command-line can only contain strings.
When we pass a string into sub, Ruby recognizes it's not a regular expression, so it treats it as a literal string. Here's the difference in action:
'foo'.sub('/o/', '') # => "foo"
'foo'.sub(/o/, '') # => "fo"
The first can't find "/o/" in "foo" so nothing changes. It can find /o/ though and returns the result after replacing the two "o".
Another way of looking at it is:
'foo'.match('/o/') # => nil
'foo'.match(/o/) # => #<MatchData "o">
where match finds nothing for the string but can find a hit for /o/.
And all that leads to what's happening in your code. Because sub is being passed a string, it's trying to do a literal match for the regex, and won't be able to find it. You need to change the code to:
sub(Regexp.new(ARGV[1]), '')
but that's not all that has to change. Regexp.new(...) will convert what's passed in into a regular expression, but if you're passing in '/o/' the resulting regular expression will be:
Regexp.new('/o/') # => /\/o\//
which is probably not what you want:
'foo'.match(/\/o\//) # => nil
Instead you want:
Regexp.new('o') # => /o/
'foo'.match(/o/) # => #<MatchData "o">
So, besides changing your code, you'll need to make sure that what you pass in is a valid expression, minus any leading and trailing /.
Based on this answer in the thread Convert a string to regular expression ruby, you should use
x = x.sub(/#{ARGV[1]}/,'')
I tested it with this file (test.rb):
puts "You should not see any number [0123456789].".gsub(/#{ARGV[0]}/,'')
I called the file like so:
ruby test.rb "\d+"
# => You should not see any number [].

Replace single quote with backslash single quote

I have a very large string that needs to escape all the single quotes in it, so I can feed it to JavaScript without upsetting it.
I have no control over the external string, so I can't change the source data.
Example:
Cote d'Ivoir -> Cote d\'Ivoir
(the actual string is very long and contains many single quotes)
I'm trying to this by using gsub on the string, but can't get this to work:
a = "Cote d'Ivoir"
a.gsub("'", "\\\'")
but this gives me:
=> "Cote dIvoirIvoir"
I also tried:
a.gsub("'", 92.chr + 39.chr)
but got the same result; I know it's something to do with regular expressions, but I never get those.
The %q delimiters come in handy here:
# %q(a string) is equivalent to a single-quoted string
puts "Cote d'Ivoir".gsub("'", %q(\\\')) #=> Cote d\'Ivoir
The problem is that \' in a gsub replacement means "part of the string after the match".
You're probably best to use either the block syntax:
a = "Cote d'Ivoir"
a.gsub(/'/) {|s| "\\'"}
# => "Cote d\\'Ivoir"
or the Hash syntax:
a.gsub(/'/, {"'" => "\\'"})
There's also the hacky workaround:
a.gsub(/'/, '\#').gsub(/#/, "'")
# prepare a text file containing [ abcd\'efg ]
require "pathname"
backslashed_text = Pathname("/path/to/the/text/file.txt").readlines.first.strip
# puts backslashed_text => abcd\'efg
unslashed_text = "abcd'efg"
unslashed_text.gsub("'", Regexp.escape(%q|\'|)) == backslashed_text # true
# puts unslashed_text.gsub("'", Regexp.escape(%q|\'|)) => abcd\'efg

Ruby: Replace certain characters in an ascii range with their hex representations

I need to replace certain ascii characters like # and & with their hex representations for a URL which would be 40 and 26 respectively.
How can I do this in ruby? there are also some characters most notably '-' which does not need to be replaced.
require 'uri'
URI.escape str, /[#&]/
Obviously, you can widen the regex with more characters you want to escape. Or, if you want to do a whitelisting approach, you can do, say,
URI.escape str, /[^-\w]/
This is ruby, so there's a mandatory 20 different ways to do it. Here's mine:
>> a = 'one&two%three'
=> "one&two%three"
>> a.gsub(/[&%]/, '&' => '&'.ord, '%' => '%'.ord)
=> "one38two37three"
I'm pretty sure Ruby has this functionality built in for URLs. However, if you want to define some more general translation facility you may use code like the following:
s = "h#llo world"
t = { " " => "%20", "#" => "%40" };
puts s.split(//).map { |c| t[c] || c }.join
Which would output
h%40llo%20world
In the above code, t is a hash defining the mapping from specific characters to their representation. The string is broken into characters and the hash is searched for each character's equivalent.
More generically and easily:
require 'uri'
URI.escape(your_string,Regexp.new("[^#{URI::PATTERN::UNRESERVED}]")

How to extract a single character (as a string) from a larger string in Ruby?

What is the Ruby idiomatic way for retrieving a single character from a string as a one-character string? There is the str[n] method of course, but (as of Ruby 1.8) it returns a character code as a fixnum, not a string. How do you get to a single-character string?
In Ruby 1.9, it's easy. In Ruby 1.9, Strings are encoding-aware sequences of characters, so you can just index into it and you will get a single-character string out of it:
'µsec'[0] => 'µ'
However, in Ruby 1.8, Strings are sequences of bytes and thus completely unaware of the encoding. If you index into a string and that string uses a multibyte encoding, you risk indexing right into the middle of a multibyte character (in this example, the 'µ' is encoded in UTF-8):
'µsec'[0] # => 194
'µsec'[0].chr # => Garbage
'µsec'[0,1] # => Garbage
However, Regexps and some specialized string methods support at least a small subset of popular encodings, among them some Japanese encodings (e.g. Shift-JIS) and (in this example) UTF-8:
'µsec'.split('')[0] # => 'µ'
'µsec'.split(//u)[0] # => 'µ'
Before Ruby 1.9:
'Hello'[1].chr # => "e"
Ruby 1.9+:
'Hello'[1] # => "e"
A lot has changed in Ruby 1.9 including string semantics.
Should work for Ruby before and after 1.9:
'Hello'[2,1] # => "l"
Please see Jörg Mittag's comment: this is correct only for single-byte character sets.
'abc'[1..1] # => "b"
'abc'[1].chr # => "b"

Resources