In ruby, how do I turn a text representation of a byte in to a byte? - ruby

What is the best way to turn the string "FA" into /xFA/ ?
To be clear, I don't want to turn "FA" into 7065 or "FA".to_i(16).
In Java the equivalent would be this:
byte b = (byte) Integer.decode("0xFA");

So you're using / markers, but you aren't actually asking about regexps, right?
I think this does what you want:
['FA'].pack('H*')
# => "\xFA"
There is no actual byte type in ruby stdlib (I don't think? unless there's one I don't know about?), just Strings, that can be any number of bytes long (in this case, one). A single "byte" is typically represented as a 1-byte long String in ruby. #bytesize on a String will always return the length in bytes.
"\xFA".bytesize
# => 1
Your example happens not to be a valid UTF-8 character, by itself. Depending on exactly what you're doing and how you're environment is set up, your string might end up being tagged with a UTF-8 encoding by default. If you are dealing with binary data, and want to make sure the string is tagged as such, you might want to #force_encoding on it to be sure. It should NOT be neccesary when using #pack, the results should be tagged as ASCII-8BIT already (which has a synonym of BINARY, it's basically the "null encoding" used in ruby for binary data).
['FA'].pack('H*').encoding
=> #<Encoding:ASCII-8BIT
But if you're dealing with string objects holding what's meant to be binary data, not neccesarily valid character data in any encoding, it is useful to know you may sometimes need to do str.force_encoding("ASCII-8BIT") (or force_encoding("BINARY"), same thing), to make sure your string isn't tagged as a particular text encoding, which would make ruby complain when you try to do certain operations on it if it includes invalid bytes for that encoding -- or in other cases, possibly do the wrong thing
Actually for a regexp
Okay, you actually do want a regexp. So we have to take our string we created, and embed it in a regexp. Here's one way:
representation = "FA"
str = [representation].pack("H*")
# => "\xFA"
data = "\x01\xFA\xC2".force_encoding("BINARY")
regexp = Regexp.new(str)
data =~ regexp
# => 1 (matched on byte 1; the first byte of data is byte 0)
You see how I needed the force_encoding there on the data string, otherwise ruby would default to it being a UTF-8 string (depending on ruby version and environment setup), and complain that those bytes aren't valid UTF-8.
In some cases you might need to explicitly set the regexp to handle binary data too, the docs say you can pass a second argument 'n' to Regexp.new to do that, but I've never done it.

Related

JavaScript alternative to Ruby's force_encoding(Encoding::ASCII_8BIT)

I'm making my way through the Building Git book, but attempting to build my implementation in JavaScript. I'm stumped at the part of reading data in this file format that apparently only Ruby uses. Here is the excerpt from the book about this:
Note that we set the string’s encoding13 to ASCII_8BIT, which is Ruby’s way of saying that the string represents arbitrary binary data rather than text per se. Although the blobs we’ll be storing will all be ASCII-compatible source code, Git does allow blobs to be any kind of file, and certainly other kinds of objects — especially trees — will contain non-textual data. Setting the encoding this way means we don’t get surprising errors when the string is concatenated with others; Ruby sees that it’s binary data and just concatenates the bytes and won’t try to perform any character conversions.
Is there a way to emulate this encoding in JS?
or
Is there an alternative encoding I can use that JS and Ruby share that won't break anything?
Additionally, I've tried using Buffer.from(< text input >, 'binary') but it doesn't result in the same amount of bytes that the ruby ASCII-8BIT returns because in Node.js binary maps to ISO-8859-1.
Node certainly supports binary data, that's kind of what Buffer is for. However, it is crucial to know what you are converting into what. For example, the emoji "☺️" is encoded as six bytes in UTF-8:
// UTF-16 (JS) string to UTF-8 representation
Buffer.from('☺️', 'utf-8')
// => <Buffer e2 98 ba ef b8 8f>
If you happen to have a string that is not a native JS string (i.e. its encoding is different), you can use the encoding parameter to make Buffer interpret each character in a different manner (though only several different conversions are supported). For example, if we have a string of six characters that correspond to the six numbers above, it is not a smiley face for JavaScript, but Buffer.from can help us repackage it:
Buffer.from('\u00e2\u0098\u00ba\u00ef\u00b8\u008f', 'binary')
// => <Buffer e2 98 ba ef b8 8f>
JavaScript itself has only one encoding for its strings; thus, the parameter 'binary' is not really the binary encoding, but a mode of operation for Buffer.from, telling it that the string would have been a binary string if each character were one byte (however, since JavaScript internally uses UCS-2, each character is always represented by two bytes). Thus, if you use it on something that is not a string of characters in range from U+0000 to U+00FF, it will not do the correct thing, because there no such thing (GIGO principle). What it will actually do is get the lower byte of each character, which is probably not what you want:
Buffer.from('STUFF', 'binary') // 8BIT range: U+0000 to U+00FF
// => <Buffer 42 59 54 45 53> ("STUFF")
Buffer.from('STUFF', 'binary') // U+FF33 U+FF34 U+FF35 U+FF26 U+FF26
// => <Buffer 33 34 35 26 26> (garbage)
So, Node's Buffer structure exactly corresponds to Ruby's ASCII-8BIT "encoding" (binary is an encoding like "bald" is a hair style — it simply means no interpretation is attached to bytes; e.g. in ASCII, 65 means "A"; but in binary "encoding", 65 is just 65). Buffer.from with 'binary' lets you convert weird strings where one character corresponds to one byte into a Buffer. It is not the normal way of handling binary data; its function is to un-mess-up binary data when it has been read incorrectly into a string.
I assume you are reading a file as string, then trying to convert it to a Buffer — but your string is not actually in what Node considers to be the "binary" form (a sequence of characters in range from U+0000 to U+00FF; thus "in Node.js binary maps to ISO-8859-1" is not really true, because ISO-8859-1 is a sequence of characters in range from 0x00 to 0xFF — a single-byte encoding!).
Ideally, to have a binary representation of file contents, you would want to read the file as a Buffer in the first place (by using fs.readFile without an encoding), without ever touching a string.
(If my guess here is incorrect, please specify what the contents of your < text input > is, and how you obtain it, and in which case "it doesn't result in the same amount of bytes".)
EDIT: I seem to like typing Array.from too much. It's Buffer.from, of course.

Serialization delimiter in Ruby

I'm writing a serialize method that converts a Tree into a string for storage. I was looking for a delimiter to use in the serialization and wasn't sure what to use.
I can't use , because that might exist as a data value in a node. e.g.
A
/ \
B ,
would serialize to A, B, ,, and break my deserialization method. Can I use non-printable ASCII characters, or should I just guess what character(s) are unlikely to show up as input and use those as my delimiters?
Here's what my serialize method looks like, if you're curious:
def serialize(root)
if root.nil?
""
else
root.val + DELIMITER +
serialize(root.left) + DELIMITER +
serialize(root.right)
end
end
There are several common methods I can think of:
Escaping: you define an escape symbol that "escapes" from the "special" interpretation. Think about how \ acts as an escape character in Ruby string literals.
Fixed Fields / Length Encoding: you know in advance where a field begins and ends. (Fixed fields are basically a special-case of length encoding where you can leave out the length because it is always the same.)
Example for escaping:
def serialize(root)
if root.nil?
""
else
"#{escape(root.val)},#{serialize(root.left)},#{serialize(root.right)}" # using ,
end
end
private def escape(str) str.gsub('\', '\\').gsub(',', '\,') end
Example for length encoding:
def serialize(root)
if root.nil?
"0,"
else
"#{root.val.size},#{root.val}#{serialize(root.left)}#{serialize(root.right)}" # using length encoding
end
end
Any , you find within size characters belongs to the value. Fixed fields would basically just concatenate the values and assume that they are all the same fixed length.
You might want to look at how existing serialization formats handle it, like OGDL: Ordered Graph Data Language, YAML: YAML Ain't Markup Language, JSON, CSV (Character-Separated Values), XML (eXtensible Markup Language).
If you want to look at binary formats, you can check out Ruby's Marshal format or ASN.1.
Your idea of finding a seldom-used character is good, even if you use escaping, you will still need less escaping with a less used character. Just imaginee what it would look likee if 'ee' was the eescapee characteer. However, I think using a non-printable character goes too far: unless you specifically want to design a binary format (such as Ruby's Marshal, Python's Pickle, or Java's Serialization), "less debuggability" (i.e. debugging by simply inspecting the output with less) is a nice property to have and one that you should not give up easily.

string to integer conversion omitting first character of charset

This is more of a general problem than a ruby specific one, I just happen to be doing it in ruby. I am trying to convert a string into an Integer/Long/Bigint, or whatever you want to call it, using a charset for example Base62 (0-9a-zA-Z).
Problem is when I try to convert a string like "0ab" into an integer and that integer back to a string I get "ab". This occurs with any string starting the first character of the alphabet.
Here is an example implementation, that has the same issue.
https://github.com/jtzemp/base62/blob/master/lib/base62.rb
In action:
2.1.3 :001 > require 'base62'
=> true
2.1.3 :002 > Base62.decode "0ab"
=> 2269
2.1.3 :003 > Base62.encode 2269
=> "ab"
I might be missing the obvious.
How can I convert bidirectionally without that exception?
You're correct that this is a more general problem.
One solution is to use "padding", which fills in extra information such as indicating missing bits, or a conversion that isn't quite perfectly clean.
In your particular code for example, you are currently losing the leading character if it's the first primitive. This is because the leading character has a zero index, and you're adding the zero to your int, which doesn't change anything.
In your code, the padding could be accomplished a variety of ways.
For example, prepending a given leading character that is not the first primitive.
Essentially, you need to choose a way to protect the zero value, so it is not lost by the int.
An alternate solution is to change your storage from using an int to using any other object that doesn't lose leading zeros, such as a string. This is how a typical Base64 encoding class does it: the input is a string, and the storage is also a string.

Create a SHA1 hash in Ruby via .net technique

OK hopefully the title didn't scare you away. I'm creating a sha1 hash using Ruby but it has to follow a formula that our other system uses to create the hash.
How can I do the following via Ruby? I'm creating hashes fine - but the format stuff is confusing me - curious if there's something equiv in the Ruby standard library.
System Security Cryptography (MSDN)
Here's the C# code that I'm trying to convert to Ruby. I'm making my hash fine, but not sure about the 'String.Format("{0,2:X2}' part.
//create our SHA1 provider
SHA1 sha = new SHA1CryptoServiceProvider();
String str = "Some value to hash";
String hashedValue; -- hashed value of the string
//hash the data -- use ascii encoding
byte[] hashedDataBytes = sha.ComputeHash(Encoding.ASCII.GetBytes(str));
//loop through each byte in the byte array
foreach (byte b in hashedDataBytes)
{
//convert each byte -- append to result
hashedValue += String.Format("{0,2:X2}", b);
}
A SHA1 hash of a specific piece of data is always the same hash (effectively just a large number), the only variation should be how you need to format it, to e.g. send to the other system. Although particularly obscure systems might post-process the data, truncate it or only take alternate bytes etc.
At a very rough guess from reading the C# code, this ends up a standard looking 40 character hex string. So my initial thought in Ruby is:
require 'digest/sha1'
Digest::SHA1.hexdigest("Some value to hash").upcase
. . . I don't know however what the force to ascii format in C# would do when it starts with e.g. a Latin-1 or UTF-8 string. They would be useful example inputs, if only to see C# throw an exception - you know then whether you need to worry about character encoding for your Ruby version. The data input to SHA1 is bytes, not characters, so which encoding to use and how to convert are parts of your problem.
My current understanding is that Encoding.ASCII.GetBytes will force anything over character number 127 to a '?', so you may need to emulate that in Ruby using a .gsub or similar, especially if you are actually expecting that to come in from the data source.

ruby: representing binary string as Bignum

I need to have a numeric representation of a binary string of arbitrary length. This seemingly trivial task unexpectedly turned out to be complex. The best I could come up with so far is
string.unpack('H*')[0].to_i(16)
but this operation lacks reversibility, because unpack may return a leading zero in the highest nibble:
['ABC'].pack('H*') == ['0ABC'].pack('H*') # false
Now I need to check if I got even number of nibbles after converting from integer, pad with zero if needed etc. It's all good and clear, but I just can't believe this must be so convoluted.
Update with example:
s = "\x01\x1D\x9A".force_encoding 'binary' # "\x01\x1D\x9A"
s.unpack('H*') # ["011d9a"]
s.unpack('H*')[0].to_i(16) # 73114
Now let's decode:
s.unpack('H*')[0].to_i(16).to_s(16) # "11d9a" — notice that leading zero is gone
[s.unpack('H*')[0].to_i(16).to_s(16)].pack('H*') # "\x11\xD9\xA0"
[s.unpack('H*')[0].to_i(16).to_s(16)].pack('H*') == s # false, obviously
In other words, we failed to decode to the same value we started with.
Unfortunately, I have zero knowledge in Ruby, though have an idea which you could check...
May be .to_s(16) method converts to string which have different default encoding and may be this matters when you make string comparison after .pack('H*')?
May be this will work:
[s.unpack('H*')[0].to_i(16).to_s(16)].pack('H*').force_encoding 'binary' == s
Otherwise it's difficult to imagine how nibble with leading zero could convert to string byte which is different from same string byte but converted from hex nibble without zero.

Resources