Ruby equivalent of ReadString? - ruby

I'm working on a project with a "customer made" database. He developed a C++/CLI application that stores and retrieves his data from a binary file using the BinaryWriter.Write(String) and BinaryReader.ReadString() methods.
I'm no C++/CLI expert but from what I understand these methods use a 7-bits encoding of the first bytes to determine the String length.
I need to access his data from a rail application, anyone's got an idea of how to do the same think in ruby?

If you're dealing with raw binary data, you'll probably need to spend some time familiarizing yourself with the pack and unpack methods and their various options. Maybe what you're describing is a "Pascal string" where the length is encoded up front, or a variation on that.
For example:
length = data.unpack("C")[0]
string = data.unpack("Ca#{length}")[0]
The double-unpack is required because you don't know the length of the string to unpack until you do the first step. You could probably do this using a substring as well, like data[1,length] if you're reasonably certain you're not dealing with UTF-8 data.

Related

PHP's pack/unpack in Go

I'm intending to rewrite a game server I already have working in PHP-cli.
As I cannot touch the client, I have to use the on-wire protocol as-is. This is a pure binary format, with multiple fields per packet. I use a modified version of PHP's pack/unpack commands to convert to and from this. A packet is generally in the form:
header:
unpack('nitem/ncmdNum/NdataLen', $buf);
data:
any of several dozen subsequent unpack strings, as identified by cmdNum. e.g.:
a32first_name/a32second_name/a32third_name/nfirst_itemid/nfirst_flags/nfirst_level/nunused/Nfirst_perm_flags/Ncvsize/a{$cvsize}contentsVector
nsuccess/ndummyfielda/a*xerror_msg/a*xtext/NcurrentVersion/NtimeLimit/a*xlimitValue/a*xlimit
[Where a{$cvsize} means a fixed length string determined by the named (usually immediately previous) value, and a* means a zero-terminated variable-length string.. ]
The current PHP implementation unpacks this and calls the a function that deals with command 'cmdNum' passing an associative array containing the unpacked data. This in turn calls the sending code with a similar array for the return values.
Whilst I'm sure I could map these to structures, reading from the input (and writing to it) wouldn't simply be a matter of dropping the buffer over the struct. Plus, most packet types are only used once, in a function dedicated to dealing with that message, so coding up several dozen structures, and the code to deal with loading each field individually, seems like a lot of work.
Is there any method or package that I can use as the basis for dealing with this sort of thing? My searching for "php unpack in go" only seems to return results based on people unpacking a single numerical value, which is obviously easy enough to replace with encoding/binary!
The unpack/pack strings are auto-generated by some other PHP based on a specification grabbed from the client, so I could change that to create a different format fairly easily, if there is something I can use. I'd normally have no issues with writing my own functions to do this sort of thing, but being totally new to Go, this might be too much off-the-bat.

Hash payloads like form data or json in ruby [duplicate]

The following question is more complex than it may first seem.
Assume that I've got an arbitrary JSON object, one that may contain any amount of data including other nested JSON objects. What I want is a cryptographic hash/digest of the JSON data, without regard to the actual JSON formatting itself (eg: ignoring newlines and spacing differences between the JSON tokens).
The last part is a requirement, as the JSON will be generated/read by a variety of (de)serializers on a number of different platforms. I know of at least one JSON library for Java that completely removes formatting when reading data during deserialization. As such it will break the hash.
The arbitrary data clause above also complicates things, as it prevents me from taking known fields in a given order and concatenating them prior to hasing (think roughly how Java's non-cryptographic hashCode() method works).
Lastly, hashing the entire JSON String as a chunk of bytes (prior to deserialization) is not desirable either, since there are fields in the JSON that should be ignored when computing the hash.
I'm not sure there is a good solution to this problem, but I welcome any approaches or thoughts =)
The problem is a common one when computing hashes for any data format where flexibility is allowed. To solve this, you need to canonicalize the representation.
For example, the OAuth1.0a protocol, which is used by Twitter and other services for authentication, requires a secure hash of the request message. To compute the hash, OAuth1.0a says you need to first alphabetize the fields, separate them by newlines, remove the field names (which are well known), and use blank lines for empty values. The signature or hash is computed on the result of that canonicalization.
XML DSIG works the same way - you need to canonicalize the XML before signing it. There is a proposed W3 standard covering this, because it's such a fundamental requirement for signing. Some people call it c14n.
I don't know of a canonicalization standard for json. It's worth researching.
If there isn't one, you can certainly establish a convention for your particular application usage. A reasonable start might be:
lexicographically sort the properties by name
double quotes used on all names
double quotes used on all string values
no space, or one-space, between names and the colon, and between the colon and the value
no spaces between values and the following comma
all other white space collapsed to either a single space or nothing - choose one
exclude any properties you don't want to sign (one example is, the property that holds the signature itself)
sign the result, with your chosen algorithm
You may also want to think about how to pass that signature in the JSON object - possibly establish a well-known property name, like "nichols-hmac" or something, that gets the base64 encoded version of the hash. This property would have to be explicitly excluded by the hashing algorithm. Then, any receiver of the JSON would be able to check the hash.
The canonicalized representation does not need to be the representation you pass around in the application. It only needs to be easily produced given an arbitrary JSON object.
Instead of inventing your own JSON normalization/canonicalization you may want to use bencode. Semantically it's the same as JSON (composition of numbers, strings, lists and dicts), but with the property of unambiguous encoding that is necessary for cryptographic hashing.
bencode is used as a torrent file format, every bittorrent client contains an implementation.
This is the same issue as causes problems with S/MIME signatures and XML signatures. That is, there are multiple equivalent representations of the data to be signed.
For example in JSON:
{ "Name1": "Value1", "Name2": "Value2" }
vs.
{
"Name1": "Value\u0031",
"Name2": "Value\u0032"
}
Or depending on your application, this may even be equivalent:
{
"Name1": "Value\u0031",
"Name2": "Value\u0032",
"Optional": null
}
Canonicalization could solve that problem, but it's a problem you don't need at all.
The easy solution if you have control over the specification is to wrap the object in some sort of container to protect it from being transformed into an "equivalent" but different representation.
I.e. avoid the problem by not signing the "logical" object but signing a particular serialized representation of it instead.
For example, JSON Objects -> UTF-8 Text -> Bytes. Sign the bytes as bytes, then transmit them as bytes e.g. by base64 encoding. Since you are signing the bytes, differences like whitespace are part of what is signed.
Instead of trying to do this:
{
"JSONContent": { "Name1": "Value1", "Name2": "Value2" },
"Signature": "asdflkajsdrliuejadceaageaetge="
}
Just do this:
{
"Base64JSONContent": "eyAgIk5hbWUxIjogIlZhbHVlMSIsICJOYW1lMiI6ICJWYWx1ZTIiIH0s",
"Signature": "asdflkajsdrliuejadceaageaetge="
}
I.e. don't sign the JSON, sign the bytes of the encoded JSON.
Yes, it means the signature is no longer transparent.
JSON-LD can do normalitzation.
You will have to define your context.
RFC 7638: JSON Web Key (JWK) Thumbprint includes a type of canonicalization. Although RFC7638 expects a limited set of members, we would be able to apply the same calculation for any member.
https://www.rfc-editor.org/rfc/rfc7638#section-3
What would be ideal is if JavaScript itself defined a formal hashing process for JavaScript Objects.
Yet we do have RFC-8785 JSON Canonicalization Scheme (JCS) which hopefully can be implemented in most libs for JSON and in particular added to popular JavaScript JSON object. With this canonicalization done it is just a matter of applying your preferred hashing algorithm.
If JCS is available in browsers and other tools and libs it becomes responsible to expect most JSON on-the-wire to be in this common canonicalized form. Common consistent application and verification of standards like this can go some way to pushing back against trivial security threats by low skilled actors.
I would do all fields in a given order (alphabetically for example). Why does arbitrary data make a difference? You can just iterate over the properties (ala reflection).
Alternatively, I would look into converting the raw json string into some well defined canonical form (remove all superflous formatting) - and hashing that.
We encountered a simple issue with hashing JSON-encoded payloads.
In our case we use the following methodology:
Convert data into JSON object;
Encode JSON payload in base64
Message digest (HMAC) the generated base64 payload .
Transmit base64 payload .
Advantages of using this solution:
Base64 will produce the same output for a given payload.
Since the resulting signature will be derived directly from the base64-encoded payload and since base64-payload will be exchanged between the endpoints, we will be certain that the signature and payload will be maintained.
This solution solve problems that arise due to difference in encoding of special characters.
Disadvantages
The encoding/decoding of the payload may add overhead
Base64-encoded data is usually 30+% larger than the original payload.

What does it mean to convert an XML document to binary?

I am having a performance issue with XDocument.Load("large_file.xml"), where it takes about 25 seconds to load the file.
I read in this question that using a binary format could offer up to a 10x performance increase.
What does a binary format look like? How do you go about converting an XML file to it?
Lets start with the implied question:
Q: What is a Binary format?
A: It is a format in which data is represented in a non-textual form. For example, a Java int might be represented as 4 bytes, rather than a sequence of decimal digits and a sign.
Q: What does it look like?
A: If you view it with a text editor / viewer, it looks like garbage.
Q: How do you go about converting an XML file to a binary form?
A: By hand. Since a binary format is essentially a format (any format) that is not text, there is no magical method of converting it.
Q: How and why is a binary format faster?
A: A binary format isn't automatically faster to load than XML (or JSON). The idea is that you (the programmer) design a specific binary format for your application that will be faster to load. You typically do this by such things as:
avoiding the inclusion of verbose / repetitive structuring information (e.g. XML tag and attribute name),
using data encodings that require less CPU effort to turn into the in-memory representations,
avoiding the inclusion of unnecessary metadata,
avoiding things that require extra in-memory data copying,
and so on.
There is lots of information in an XML format. So it's big and slow. You can create your own format.
For example:
<Data>Value</Data> can be changed to just value at a concrete address in a binary file.

how to check a ruby string is an actural string or a blob data such as image

In ruby how to check a string is an actural string or a blob data such as image, from the data type of view they are ruby string, but really their contents are very different since one is literal string, the other is blob data such as image.
Could anyone provide some clue for me? Thank you in advance.
Bytes are bytes. There is no way to declare that something isn't file data. It'd be fairly easy to construct a valid file in many formats consisting only of printable ASCII. Especially when dealing with Unicode, you're in very murky territory. If possible, I'd suggest modifying the method so that it takes two parameters... use one for passing text and the other for binary data.
One thing you might do is look at the length of the string. Most image formats are at least 500-600 bytes even for a tiny image, and while this is by no means an accurate test, if you get passed, say, a 20k string, it's probably an image. If it were text, it would be quite a bit (Like a quarter of a typical novel, or thereabouts)
Files like images or sound files have defined blocks that can be "sniffed". Wotsit.org has a lot of info about the key bytes and ways to determine what the files are. By looking at those byte offsets in your data you could figure it out.
Another way way is to use some "magic", which is code to sniff key-bytes or byte-types in a file to try to figure out what its type is. *nix systems have it built in via the file command. Do a man file or man magic for more info or check Wikipedia's article on Magic numbers in files.
Ruby Filemagic uses the same technique but is based on GNU's libmagic.
What would constitute a string? Are you expecting simple ASCII? UTF-8? Or text encoded some other way?
If you know you're going to get ASCII text or a blob then you can just spin through the first n bytes and see if anything has the eight bit set, that would tell you that you have binary. OTOH, not finding anything wouldn't guarantee that you had text.
If you're going to get UTF-8 Unicode then you'd do the same thing but look for invalid UTF-8 sequences. Of course, the same caveats apply.
You could scan the first n bytes for anything between 0x00 and 0x20. If you find any bytes that low then you probably have a binary blob of some sort. But maybe not.
As Tyler Eaves said: bytes are bytes. You're starting with a bunch of bytes and trying to find an interpretation of them that makes sense.
Your best bet is to make the caller supply the expected interpretation or take Greg's advice and use a magic number library.

manually finding the size of a block of text (ASCII format)

Is there an easy way to manually (ie. not through code) find the size (in bytes, KB, etc) of a block of selected text? Currently I am taking the text, cutting/pasting into a new text document, saving it, then clicking "properties" to get an estimate of the size.
I am developing mainly in visual studio 2008 but I need any sort of simple way to manually do this.
Note: I understand this is not specifically a programming question but it's related to programming. I need this to compare a few functions and see which one is returning the smallest amount of text. I only need to do it a few times so figured writing a method for it would be overkill.
This question isn't meaningful as asked. Text can be encoded in different formats; ASCII, UTF-8, UTF-16, etc. The memory consumed by a block of text depends on which encoding you decide to use for it.
EDIT: To answer the question you've stated now (how do I determine which function is returning a "smaller" block of text) -- given a single encoding, the shorter text will almost always be smaller as well. Why can't you just compare the lengths?
In your comment you mention it's ASCII. In that case, it'll be one byte per character.
I don't see the difference between using the code written by the app you're pasting into, and using some other code. Being a python person myself, whenever I want to check length of some text I just do it in the interactive interpreter. Surely some equivalent solution more suited to your tastes would be appropriate?
ended up just cutting/pasting the text into MS Word and using the char count feature in there

Resources