Hash payloads like form data or json in ruby [duplicate] - ruby

The following question is more complex than it may first seem.
Assume that I've got an arbitrary JSON object, one that may contain any amount of data including other nested JSON objects. What I want is a cryptographic hash/digest of the JSON data, without regard to the actual JSON formatting itself (eg: ignoring newlines and spacing differences between the JSON tokens).
The last part is a requirement, as the JSON will be generated/read by a variety of (de)serializers on a number of different platforms. I know of at least one JSON library for Java that completely removes formatting when reading data during deserialization. As such it will break the hash.
The arbitrary data clause above also complicates things, as it prevents me from taking known fields in a given order and concatenating them prior to hasing (think roughly how Java's non-cryptographic hashCode() method works).
Lastly, hashing the entire JSON String as a chunk of bytes (prior to deserialization) is not desirable either, since there are fields in the JSON that should be ignored when computing the hash.
I'm not sure there is a good solution to this problem, but I welcome any approaches or thoughts =)

The problem is a common one when computing hashes for any data format where flexibility is allowed. To solve this, you need to canonicalize the representation.
For example, the OAuth1.0a protocol, which is used by Twitter and other services for authentication, requires a secure hash of the request message. To compute the hash, OAuth1.0a says you need to first alphabetize the fields, separate them by newlines, remove the field names (which are well known), and use blank lines for empty values. The signature or hash is computed on the result of that canonicalization.
XML DSIG works the same way - you need to canonicalize the XML before signing it. There is a proposed W3 standard covering this, because it's such a fundamental requirement for signing. Some people call it c14n.
I don't know of a canonicalization standard for json. It's worth researching.
If there isn't one, you can certainly establish a convention for your particular application usage. A reasonable start might be:
lexicographically sort the properties by name
double quotes used on all names
double quotes used on all string values
no space, or one-space, between names and the colon, and between the colon and the value
no spaces between values and the following comma
all other white space collapsed to either a single space or nothing - choose one
exclude any properties you don't want to sign (one example is, the property that holds the signature itself)
sign the result, with your chosen algorithm
You may also want to think about how to pass that signature in the JSON object - possibly establish a well-known property name, like "nichols-hmac" or something, that gets the base64 encoded version of the hash. This property would have to be explicitly excluded by the hashing algorithm. Then, any receiver of the JSON would be able to check the hash.
The canonicalized representation does not need to be the representation you pass around in the application. It only needs to be easily produced given an arbitrary JSON object.

Instead of inventing your own JSON normalization/canonicalization you may want to use bencode. Semantically it's the same as JSON (composition of numbers, strings, lists and dicts), but with the property of unambiguous encoding that is necessary for cryptographic hashing.
bencode is used as a torrent file format, every bittorrent client contains an implementation.

This is the same issue as causes problems with S/MIME signatures and XML signatures. That is, there are multiple equivalent representations of the data to be signed.
For example in JSON:
{ "Name1": "Value1", "Name2": "Value2" }
vs.
{
"Name1": "Value\u0031",
"Name2": "Value\u0032"
}
Or depending on your application, this may even be equivalent:
{
"Name1": "Value\u0031",
"Name2": "Value\u0032",
"Optional": null
}
Canonicalization could solve that problem, but it's a problem you don't need at all.
The easy solution if you have control over the specification is to wrap the object in some sort of container to protect it from being transformed into an "equivalent" but different representation.
I.e. avoid the problem by not signing the "logical" object but signing a particular serialized representation of it instead.
For example, JSON Objects -> UTF-8 Text -> Bytes. Sign the bytes as bytes, then transmit them as bytes e.g. by base64 encoding. Since you are signing the bytes, differences like whitespace are part of what is signed.
Instead of trying to do this:
{
"JSONContent": { "Name1": "Value1", "Name2": "Value2" },
"Signature": "asdflkajsdrliuejadceaageaetge="
}
Just do this:
{
"Base64JSONContent": "eyAgIk5hbWUxIjogIlZhbHVlMSIsICJOYW1lMiI6ICJWYWx1ZTIiIH0s",
"Signature": "asdflkajsdrliuejadceaageaetge="
}
I.e. don't sign the JSON, sign the bytes of the encoded JSON.
Yes, it means the signature is no longer transparent.

JSON-LD can do normalitzation.
You will have to define your context.

RFC 7638: JSON Web Key (JWK) Thumbprint includes a type of canonicalization. Although RFC7638 expects a limited set of members, we would be able to apply the same calculation for any member.
https://www.rfc-editor.org/rfc/rfc7638#section-3

What would be ideal is if JavaScript itself defined a formal hashing process for JavaScript Objects.
Yet we do have RFC-8785 JSON Canonicalization Scheme (JCS) which hopefully can be implemented in most libs for JSON and in particular added to popular JavaScript JSON object. With this canonicalization done it is just a matter of applying your preferred hashing algorithm.
If JCS is available in browsers and other tools and libs it becomes responsible to expect most JSON on-the-wire to be in this common canonicalized form. Common consistent application and verification of standards like this can go some way to pushing back against trivial security threats by low skilled actors.

I would do all fields in a given order (alphabetically for example). Why does arbitrary data make a difference? You can just iterate over the properties (ala reflection).
Alternatively, I would look into converting the raw json string into some well defined canonical form (remove all superflous formatting) - and hashing that.

We encountered a simple issue with hashing JSON-encoded payloads.
In our case we use the following methodology:
Convert data into JSON object;
Encode JSON payload in base64
Message digest (HMAC) the generated base64 payload .
Transmit base64 payload .
Advantages of using this solution:
Base64 will produce the same output for a given payload.
Since the resulting signature will be derived directly from the base64-encoded payload and since base64-payload will be exchanged between the endpoints, we will be certain that the signature and payload will be maintained.
This solution solve problems that arise due to difference in encoding of special characters.
Disadvantages
The encoding/decoding of the payload may add overhead
Base64-encoded data is usually 30+% larger than the original payload.

Related

What is the essential difference between Document and Collectiction in YAML syntax?

Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.
First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.

PHP's pack/unpack in Go

I'm intending to rewrite a game server I already have working in PHP-cli.
As I cannot touch the client, I have to use the on-wire protocol as-is. This is a pure binary format, with multiple fields per packet. I use a modified version of PHP's pack/unpack commands to convert to and from this. A packet is generally in the form:
header:
unpack('nitem/ncmdNum/NdataLen', $buf);
data:
any of several dozen subsequent unpack strings, as identified by cmdNum. e.g.:
a32first_name/a32second_name/a32third_name/nfirst_itemid/nfirst_flags/nfirst_level/nunused/Nfirst_perm_flags/Ncvsize/a{$cvsize}contentsVector
nsuccess/ndummyfielda/a*xerror_msg/a*xtext/NcurrentVersion/NtimeLimit/a*xlimitValue/a*xlimit
[Where a{$cvsize} means a fixed length string determined by the named (usually immediately previous) value, and a* means a zero-terminated variable-length string.. ]
The current PHP implementation unpacks this and calls the a function that deals with command 'cmdNum' passing an associative array containing the unpacked data. This in turn calls the sending code with a similar array for the return values.
Whilst I'm sure I could map these to structures, reading from the input (and writing to it) wouldn't simply be a matter of dropping the buffer over the struct. Plus, most packet types are only used once, in a function dedicated to dealing with that message, so coding up several dozen structures, and the code to deal with loading each field individually, seems like a lot of work.
Is there any method or package that I can use as the basis for dealing with this sort of thing? My searching for "php unpack in go" only seems to return results based on people unpacking a single numerical value, which is obviously easy enough to replace with encoding/binary!
The unpack/pack strings are auto-generated by some other PHP based on a specification grabbed from the client, so I could change that to create a different format fairly easily, if there is something I can use. I'd normally have no issues with writing my own functions to do this sort of thing, but being totally new to Go, this might be too much off-the-bat.

How secure the protobuf is to get some of the data out?

Without any encryption, if the recipient has the serialized Protobuf file but does not have the generated Protobuf class (they don't have access to the .proto file that define its structure), is it possible for them to get any data in the Protobuf file from the binary?
If they have access to a part of the .proto file (for example, just one related message in the file) can they get a part of that data out from the entire file while skipping other unknown parts?
yes, absolutely; the protoc tool can help with this (see: --decode_raw), as can https://protogen.marcgravell.com/decode - so it should not be treated as "secure" at all
yes, absolutely - that's a key part built into the protocol that allows messages to be extensible such that they can decode the bits they understand and either ignore or just store (for round-trip or "extension" fields) the bits they don't understand
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values; but: you can infer and guess and reverse engineer
Ok, I have found this page https://developers.google.com/protocol-buffers/docs/encoding
The message discards all the names and is just a pair of key number and values. The generated class might offer some protection for safely reading these data and could not read unknown data. (Sure enough because the generated class was generated from known structure, .proto file)
But if I am an attacker I could reference that Encoding page and try to figure out which area in the binary corresponds to which data. For example, varint might be easy to spot after changing some data. And proceed to write my own .proto file to attack this unknown data or even a custom binary reader that can selectively read part of the binary.

Ruby equivalent of ReadString?

I'm working on a project with a "customer made" database. He developed a C++/CLI application that stores and retrieves his data from a binary file using the BinaryWriter.Write(String) and BinaryReader.ReadString() methods.
I'm no C++/CLI expert but from what I understand these methods use a 7-bits encoding of the first bytes to determine the String length.
I need to access his data from a rail application, anyone's got an idea of how to do the same think in ruby?
If you're dealing with raw binary data, you'll probably need to spend some time familiarizing yourself with the pack and unpack methods and their various options. Maybe what you're describing is a "Pascal string" where the length is encoded up front, or a variation on that.
For example:
length = data.unpack("C")[0]
string = data.unpack("Ca#{length}")[0]
The double-unpack is required because you don't know the length of the string to unpack until you do the first step. You could probably do this using a substring as well, like data[1,length] if you're reasonably certain you're not dealing with UTF-8 data.

I need a name for a particular data structure

I keep running into a certain kind of data structure, and wonder if there is a name for it. It maps very closely to JSON, but not exactly. The rules are:
It is composed entirely of maps, arrays, and primitives.
It is hierarchical. Maps contain name/value pairs, where a value can
be another map, an array, or a primitive. Arrays contain values with the same rules.
The top level is always a map.
The primitives are strings, integers, floats, booleans, and possibly
dates.
Sometimes the map is just an unordered hash, and sometimes the order
of the name/value pairs matter.
This is a really, really useful structure. You can use it to represent documents, database records, various messages, http requests, lots of stuff. I've run into it in Freemarker (as the 'data model'), Mongo, and anything that uses JSON.
It's not really JSON, because that's a file format, not a specification for a particular data structure. It's not an "object", because object trees can point to other things, like streams and functions. It's not a DOM.
What is it?
Around the office, we've started to call it a "garg", for "generalized argument".
It's not really JSON, because that's a file format, not a specification for a particular data structure.
It might not be JSON (since the specs include syntax rules), but your structure definition defines the same data structure as JSON does.
I don't think it's useful to name this structure. When you are talking about data, just call it data. When you need to interchange data you need a data-interchange format. Now JSON proves to be one damn good one.
JSON isn't just a file format. JSON is also a data structure.
From JSON.org
JSON is built on two structures:
A collection of name/value pairs. In various languages, this is
realized as an object, record, struct, dictionary, hash table, keyed
list, or associative array.
An ordered list of values. In most
languages, this is realized as an array, vector, list, or sequence.
These are universal data structures.
It is a generic data storage structure that carries around hierarchical data. I don't have a generic name for it, but if I were to implement such a beast in, say, C++, I'd probably call the abstract base class a Variant, and name the concrete types by their names: Integer, Array, Map, etc. I'd chuck them in a namespace that would relate to where I'd use them - or maybe I'd prefix the types themselves. I've seen such structures used as well, but I don't know if there is a name that I'd recognize. A DataStore, Environment, StorageBin, or anything that is generic and implies storage of data would do.
I don't see myself calling such a class hierarchy JSON, though. I would provide a JsonSerializer or some such to map this data to JSON, if I needed it.
It sounds like you're describing an associative array, with optional ordering.
That's what JSON represents, except that (I believe) JSON doesn't impose an ordering requirement. Naturally, many other representations also describe associative arrays, which is why JSON is a popular text serialization.
Update 1: JSON isn't properly an associative array. It is a description of object properties. Because it is very often construed as an associative array, many people make the same mistake I did. In fact, "object notation" is the proper name for it - surprise, surprise. :) In addition, JSON isn't a file format - it's a text serialization or markup language, which is different from a file format.
The structure is a tree with different kinds of values stored at its leafs.
In Boost, a similar structure is called Property Tree.

Resources