Serialization delimiter in Ruby - ruby

I'm writing a serialize method that converts a Tree into a string for storage. I was looking for a delimiter to use in the serialization and wasn't sure what to use.
I can't use , because that might exist as a data value in a node. e.g.
A
/ \
B ,
would serialize to A, B, ,, and break my deserialization method. Can I use non-printable ASCII characters, or should I just guess what character(s) are unlikely to show up as input and use those as my delimiters?
Here's what my serialize method looks like, if you're curious:
def serialize(root)
if root.nil?
""
else
root.val + DELIMITER +
serialize(root.left) + DELIMITER +
serialize(root.right)
end
end

There are several common methods I can think of:
Escaping: you define an escape symbol that "escapes" from the "special" interpretation. Think about how \ acts as an escape character in Ruby string literals.
Fixed Fields / Length Encoding: you know in advance where a field begins and ends. (Fixed fields are basically a special-case of length encoding where you can leave out the length because it is always the same.)
Example for escaping:
def serialize(root)
if root.nil?
""
else
"#{escape(root.val)},#{serialize(root.left)},#{serialize(root.right)}" # using ,
end
end
private def escape(str) str.gsub('\', '\\').gsub(',', '\,') end
Example for length encoding:
def serialize(root)
if root.nil?
"0,"
else
"#{root.val.size},#{root.val}#{serialize(root.left)}#{serialize(root.right)}" # using length encoding
end
end
Any , you find within size characters belongs to the value. Fixed fields would basically just concatenate the values and assume that they are all the same fixed length.
You might want to look at how existing serialization formats handle it, like OGDL: Ordered Graph Data Language, YAML: YAML Ain't Markup Language, JSON, CSV (Character-Separated Values), XML (eXtensible Markup Language).
If you want to look at binary formats, you can check out Ruby's Marshal format or ASN.1.
Your idea of finding a seldom-used character is good, even if you use escaping, you will still need less escaping with a less used character. Just imaginee what it would look likee if 'ee' was the eescapee characteer. However, I think using a non-printable character goes too far: unless you specifically want to design a binary format (such as Ruby's Marshal, Python's Pickle, or Java's Serialization), "less debuggability" (i.e. debugging by simply inspecting the output with less) is a nice property to have and one that you should not give up easily.

Related

In ruby, how do I turn a text representation of a byte in to a byte?

What is the best way to turn the string "FA" into /xFA/ ?
To be clear, I don't want to turn "FA" into 7065 or "FA".to_i(16).
In Java the equivalent would be this:
byte b = (byte) Integer.decode("0xFA");
So you're using / markers, but you aren't actually asking about regexps, right?
I think this does what you want:
['FA'].pack('H*')
# => "\xFA"
There is no actual byte type in ruby stdlib (I don't think? unless there's one I don't know about?), just Strings, that can be any number of bytes long (in this case, one). A single "byte" is typically represented as a 1-byte long String in ruby. #bytesize on a String will always return the length in bytes.
"\xFA".bytesize
# => 1
Your example happens not to be a valid UTF-8 character, by itself. Depending on exactly what you're doing and how you're environment is set up, your string might end up being tagged with a UTF-8 encoding by default. If you are dealing with binary data, and want to make sure the string is tagged as such, you might want to #force_encoding on it to be sure. It should NOT be neccesary when using #pack, the results should be tagged as ASCII-8BIT already (which has a synonym of BINARY, it's basically the "null encoding" used in ruby for binary data).
['FA'].pack('H*').encoding
=> #<Encoding:ASCII-8BIT
But if you're dealing with string objects holding what's meant to be binary data, not neccesarily valid character data in any encoding, it is useful to know you may sometimes need to do str.force_encoding("ASCII-8BIT") (or force_encoding("BINARY"), same thing), to make sure your string isn't tagged as a particular text encoding, which would make ruby complain when you try to do certain operations on it if it includes invalid bytes for that encoding -- or in other cases, possibly do the wrong thing
Actually for a regexp
Okay, you actually do want a regexp. So we have to take our string we created, and embed it in a regexp. Here's one way:
representation = "FA"
str = [representation].pack("H*")
# => "\xFA"
data = "\x01\xFA\xC2".force_encoding("BINARY")
regexp = Regexp.new(str)
data =~ regexp
# => 1 (matched on byte 1; the first byte of data is byte 0)
You see how I needed the force_encoding there on the data string, otherwise ruby would default to it being a UTF-8 string (depending on ruby version and environment setup), and complain that those bytes aren't valid UTF-8.
In some cases you might need to explicitly set the regexp to handle binary data too, the docs say you can pass a second argument 'n' to Regexp.new to do that, but I've never done it.

Ruby Regular Expression lookahead to Split at pipe unless contained in brackets

I'm trying to decode the following string:
body = '{type:paragaph|class:red|content:[class:intro|body:This is the introduction paragraph.][body:This is the second paragraph.]}'
body << '{type:image|class:grid|content:[id:1|title:image1][id:2|title:image2][id:3|title:image3]}'
I need the string to split at the pipes but not where a pipe is contained with square brackets, to do this I think I need to perform a lookahead as described here: How to split string by ',' unless ',' is within brackets using Regex?
My attempt(still splits at every pipe):
x = self.body.scan(/\{(.*?)\}/).map {|m| m[0].split(/ *\|(?!\]) */)}
->
[
["type:paragaph", "class:red", "content:[class:intro", "body:This is the introduction paragraph.][body:This is the second paragraph.]"]
["type:image", "class:grid", "content:[id:1", "title:image1][id:2", "title:image2][id:3", "title:image3]"]
]
Expecting:
->
[
["type:paragaph", "class:red", "content:[class:intro|body:This is the introduction paragraph.][body:This is the second paragraph.]"]
["type:image", "class:grid", "content:[id:1|title:image1][id:2|title:image2][id:3|title:image3]"]
]
Does anyone know the regex required here?
Is it possible to match this regex? I can't seem to modify it correctly Regular Expression to match underscores not surrounded by brackets?
I modified the answer here Split string in Ruby, ignoring contents of parentheses? to get:
self.body.scan(/\{(.*?)\}/).map {|m| m[0].split(/\|\s*(?=[^\[\]]*(?:\[|$))/)}
Seems to do the trick. Though I'm sure if there's any shortfalls.
Dealing with nested structures that have identical syntax is going to make things difficult for you.
You could try a recursive descent parser (a quick Google turned up https://github.com/Ragmaanir/grammy - not sure if any good)
Personally, I'd go for something really hacky - some gsubs that convert your string into JSON, then parse with a JSON parser :-). That's not particularly easy either, though, but here goes:
require 'json'
b1 = body.gsub(/([^\[\|\]\:\}\{]+)/,'"\1"').gsub(':[',':[{').gsub('][','},{').gsub(']','}]').gsub('}{','},{').gsub('|',',')
JSON.parse('[' + b1 + ']')
It wasn't easy because the string format apparently uses [foo:bar][baz:bam] to represent an array of hashes. If you have a chance to modify the serialised format to make it easier, I would take it.
I modified the answer here Split string in Ruby, ignoring contents of parentheses? to get:
self.body.scan(/\{(.*?)\}/).map {|m| m[0].split(/\|\s*(?=[^\[\]]*(?:\[|$))/)}
Seems to do the trick. If it has any shortfalls please suggest something better.

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

How to get a Ruby substring of a Unicode string?

I have a field in my Rails model that has max length 255.
I'm importing data into it, and some times the imported data has a length > 255. I'm willing to simply chop it off so that I end up with the largest possible valid string that fits.
I originally tried to do field[0,255] in order to get this, but this will actually chop trailing Unicode right through a character. When I then go to save this into the database, it throws an error telling me I have an invalid character due to the character that's been halved or quartered.
What's the recommended way to chop off Unicode characters to get them to fit in my space, without chopping up individual characters?
Uh. Seems like truncate and friends like to play with chars, but not their little cousins bytes. Here's a quick answer for your problem, but I don't know if there's a more straighforward and elegant question I mean answer
def truncate_bytes(string, size)
count = 0
string.chars.take_while{|c| (a += c.bytes.to_a.length) <= size }.join
end
Give a look at the Chars class of ActiveSupport.
Use the multibyte proxy method (mb_chars) before manipulating the string:
str.mb_chars[0,255]
See http://api.rubyonrails.org/classes/String.html#method-i-mb_chars.
Note that until Rails 2.1 the method was "chars".

Ruby -- looking for some sort of "Regexp unescape" method

I have a bunch of string with special escape codes that I want to store unescaped- eg, the interpreter shows
"\\014\"\\000\"\\016smoothing\"\\011mean\"\\022color\"\\011zero#\\016"
but I want it to show (when inspected) as
"\014\"\000\"\016smoothing\"\011mean\"\022color\"\011zero#\016"
What's the method to unescape them? I imagine that I could make a regex to remove 1 backslash from every consecutive n backslashes, but I don't have a lot of regex experience and it seems there ought to be a "more elegant" way to do it.
For example, when I puts MyString it displays the output I'd like, but I don't know how I might capture that into a variable.
Thanks!
Edited to add context: I have this class that is being used to marshal / restore some stuff, but when I restore some old strings it spits out a type error which I've determined is because they weren't -- for some inexplicable reason -- stored as base64. They instead appear to have just been escaped, which I don't want, because trying to restore them similarly gives the TypeError
TypeError: incompatible marshal file format (can't be read)
format version 4.8 required; 92.48 given
because Marshal looks at the first characters of the string to determine the format.
require 'base64'
class MarshaledStuff < ActiveRecord::Base
validates_presence_of :marshaled_obj
def contents
obj = self.marshaled_obj
return Marshal.restore(Base64.decode64(obj))
end
def contents=(newcontents)
self.marshaled_obj = Base64.encode64(Marshal.dump(newcontents))
end
end
Edit 2: Changed wording -- I was thinking they were "double-escaped" but it was only single-escaped. Whoops!
If your strings give you the correct output when you print them then they are already escaped correctly. The extra backslashes you see are probably because you are displaying them in the interactive interpreter which adds extra backslashes for you when you display variables to make them less ambiguous.
> x
=> "\\"
> puts x
\
=> nil
> x.length
=> 1
Note that even though it looks like x contains two backslashes, the length of the string is one. The extra backslash is added by the interpreter and is not really part of the string.
If you still think there's a problem, please be more specific about how you are displaying the strings that you mentioned in your question.
Edit: In your example the only thing that need unescaping are octal escape codes. You could try this:
x = x.gsub(/\\[0-2][0-7]{2}/){ |c| c[1,3].to_i(8).chr }

Resources