What is the essential difference between Document and Collectiction in YAML syntax? - yaml

Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.

First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.

Related

Do (document) bundle entries always have to be referenced or referencing?

The specification for FHIR documents seems to mandate that all bundle entries in the document resource be part of the reference graph rooted at the Composition entry. That is, they should be the source or the target of a reference relation that traces all the way up to the root entry.
Unfortunately I have not been able to locate all the relevant passages in the FHIR specification; one place where it is spelled out is in 3.3.1 Document Content, but it is not really clear whether this pertains to all bundles of type 'document' (i.e. even those that happen to be bundles with type code 'document' but are merely collections of machine-processable data without any aspirations to represent a FHIRy document).
The problem with the referencedness requirement lies in the fact that the HAPI validator employs linear search for checking the references. So, if we have to ship N bundle entries full of data to a payor, we have to include a list with N references (one for each data-bearing bundle entry). That leads to N reference searches with O(N) effort during validation, which makes the reference checking complexity effectively quadratic in the number of entries.
This easily brings even the most powerful computers to their knees. Current size contraints effectively cap the number of entries per file at roughly 25000, and the HAPI validator needs several hours to chew through that, even on the most powerful CPUs currently available. Without the references, validation would take less than a minute for the same file.
In our use case, data-bearing entries have no identity outside of the containing bundle file. Practically speaking they would need neither entry.fullUrl nor entry.resource.id, because their business identifiers are contained in included base64 blobs. However, presence or absence of these identifiers has no practical influence on the time needed for validation (fractions of a second even for a 1 GB file), so who cares. It's the list of references that kills the HAPI validator.
Perhaps it would be possible to fulfil the letter of the referencedness requirement by making all entries include a reference to the Composition. The HAPI validator doesn't care either way, so I don't know whether that would be valid or not. But even if it were FHIRly valid, it would be a monstrously silly workaround.
Is there a way to ditch the referencedness requirement? Perhaps by changing the bundle type to something like 'collection', or by using contained resources?
P.S.: for the moment we are using a workaround that cuts the time for validation from hours to less than a minute, but it's a hack, and we currently don't have the resources to fix the HAPI validator. What I'm mostly concerned about is the question how the specifications (profiles) need to be changed in order to avoid the problem I described.
(i.e. even those that happen to be bundles with type code 'document' but are merely collections of machine-processable data without any aspirations to represent a FHIRy document)
If it is not a document, and not intended to be one, do not use the 'document' Bundle type. If you do, you would me misrepresenting the data which is what FHIR tries to avoid.
It seems like you want to send a collection of resources that are not necessarily related, so
Is there a way to ditch the referencedness requirement? Perhaps by changing the bundle type to something like 'collection'
Yes, I would use 'collection', or maybe a 'batch/transaction' depending on what I want to tell the receiver to do with the data.
The documents page says:
The document bundle SHALL include only:
The Composition resource, and any resources directly or indirectly (e.g. recursively) referenced from it
A Binary resource containing a stylesheet (as described below)
Provenance Resources that have a target of Composition or another resource included in the document
A document is a frozen set of content intended as an attested, human-readable, frozen set of content. If that's not what you need, then use a different Bundle type. However, if you do need the 'document' type, that doesn't mean that systems should necessarily validate all requirements at runtime

Why does protobuf's FieldMask use field names instead of field numbers?

In the docs for FieldMask the paths use the field names (e.g., foo.bar.buzz), which means renaming the message field names can result in a breaking change.
Why doesn't FieldMask use the field numbers to define the path?
Something like 1.3.1?
You may want to consider filing an issue on the GitHub protocolbuffers repo for a definitive answer from the code's authors.
Your proposal seems logical. Using names may be a historical artifact. There's a possibly relevant comment on an issue thread in that repo:
https://github.com/protocolbuffers/protobuf/issues/3793#issuecomment-339734117
"You are right that if you use FieldMasks then you can't safely rename fields. But for that matter, if you use the JSON format or text format then you have the same issue that field names are significant and can't be changed easily. Changing field names really only works if you use the binary format only and avoid FieldMasks."
The answer for your question lies in the fact FieldMasks are a convention/utility developed on top of the proto3 schema definition language, and not a feature of it (and that utility is not present in all of the language bindings)
While you’re right in your observation that it can break easily (as schemas tend evolve and change), you need to consider this design choice from a user friendliness POV:
If you’re building an API and want to allow the user to select the field set present inside the response payload (the common use case for field masks), it’ll be much more convenient for you to allow that using field paths, rather then binary fields indices, as the latter would force the user of the gRPC/protocol generated code to be “aware” of the schema. That’s not always the desired case when providing API as a code software packages.
While implementing this as a proto schema feature can allow the user to have the best of both worlds (specify field paths, have them encoded as binary indices) for binary encoding, it would also:
Complicate code generation requirements
Still be an issue for plain text encoding.
So, you can understand why it was left as an “external utility”.

YAML : Use mapped list vs array

I am creating a configuration file for my application. To do it, I decided to use YAML for its simplicity and reliability.
I am currently designing a special part of my application: In this part, I have to list and configure all datasets I want to use in a module. To do that I wrote this :
// Other stuff
datasets:
rate_variation:
name: Rate variation over time # Optional
description: Description here # Optional
type: POINTS_2D
options:
REFRESH_TIME: 5 # Time of refresh in second
frequency_variation:
name: Frequency variation over time
description: Description here # Optional
type: POINTS_2D
But, after some reflection, I have some doubts about it. Because maybe something like this is better :
datasets:
- id: rate_variation
name: Rate variation over time # Optional
description: Description here # Optional
type: POINTS_2D
options:
REFRESH_TIME: 5 # Time of refresh in second
- id: frequency_variation
name: Frequency variation over time
description: Description here # Optional
type: POINTS_2D
I use the ID to identify each dataset in my scripts (two datasets must have a different id) and generate output files for each of them.
But now, I really don't know what is the best solution...
What would you recommend to use? And for what reason?
Quick Answer (TL;DR)
YAML can be normalized quite cleanly and in a straightforward manner using YAML ddconfig format
Using this approach can simplify construction and maintenance of configuration files, and make them highly flexible for later use by many types of consuming applications.
Detailed Answer
Context
Data normalization (aka YAML schema definition) with YAML ddconfig format
(tag:dreftymac#dreftymac.org,2017:ddconfig)
dmid://uu773yamldata1620421509
Problem
Scenario: Developer graille_stentiplub is creating a configuration file format for use with YAML.
the data structure (i.e., schema) for the YAML must be flexible for use in many contexts.
the schema should be amenable to arbitrary and flexible queries where the structure of the YAML does not "get in the way".
the schema should be easy to read and understand by humans.
the schema should be easily manipulated by any programming environment capable of processing standard YAML.
Special considerations: graille_stentiplub wants an easy way to determine when to use lists vs mappings.
Example
the following is a simple config file using YAML ddconfig format
dataroot:
file_metadata_str: |
### <beg-block>
### - caption: "my first project"
### notes: |
### * href="//home/sm/docs/workup/my_first_project.txt"
### <end-block>
project_info:
prj_name_nice: StackOverflow Demo Answer Project
prj_name_mach: stackoverflow_demo_001a
prj_sponsor_url: https://stackoverflow.com/questions/54349286
prj_dept_url: https://demo-university.edu/dept/basketweaving
dataset_recipient_list:
- graille_stentiplub#example.org
- dreftymac_lufcrom#demo-university.edu
- nobody_knows_who_you_are#example.com
dataset_variations_table:
- dvar_id: rate_variation
dvar_name: Rate variation over time # Optional
dvar_description: Description here # Optional
dvar_type: POINTS_2D
dvar_opt_refresh_per_second: 5 # Time in seconds
- dvar_id: frequency_variation
dvar_name: Frequency variation over time
dvar_description: Description here # Optional
dvar_type: POINTS_2D
Explanation
The entire data structure is nested under a toplevel key called dataroot (this is optional).
Inclusion of the dataroot key makes the YAML structure more addressible but is not necessary.
Using a filesystem analogy, you can think of dataroot as a root-level directory.
Using an XML analogy, you can think of this as the root-level XML tag.
The entire data structure consists of a YAML mapping (aka dictionay) (aka associative-array).
every mapping key is a first-level child of dataroot (or else a toplevel key if dataroot is omitted).
There are different types of mapping keys:
String: (suffix _str) indicates that the mapped value is a string (aka scalar) value.
List: (suffix _list) indicates the mapped value is a list (aka sequence).
Info: (suffix _info) indicates the mapped value is mapping (aka dictionary) (aka associative-array).
Table: (suffix _table) indicates the mapped value is a sequence-of-mappings (aka table).
Tree: (suffix _tree) indicates a composite structure with support for one or more nested parent-child relationships.
Rationale
The YAML ddconfig format coincides nicely with many different contexts and tools.
This allows for simplified decision making when laying out the configuration file format, as well as simplified programming when parsing the file.
Simplicity
a _list mapping consists of a sequence of scalar-value items with no nesting.
a _info mapping consists of a scalar-key and a scalar-value (name-value pairs) with no nesting.
a _table mapping is simply a sequence of _info mappings.
nesting of arbitrary depth can be accomplished through YAML anchors and aliases, thus supporting the _tree composite data structure.
Similarity to relational databases
You can think of a ddconfig _info mapping as a single record from a standard table in a relational database.
You can think of a ddconfig _table mapping as a standard table in a relational database.
This similarity makes it extremely straightforward to transmit YAML to a database if and where necessary.
Anchors and aliases
The YAML ddconfig format works well with YAML anchors and aliases.
One or more _info mappings can be easily converted to a _table mapping by way of aliases.
Multiple _info mappings can be combined together into another _info mapping by way of YAML merge keys.
See also
github link https://github.com/dreftymac/trypublic/search?q=uu773yamldata1620421509
With the first option, YAML enforces that there are no duplicate IDs. Therefore, an editor supporting YAML may support your user by showing an error in this case. With the second option, you need to check uniqueness in your code and the user only sees the error when loading the syntactically correct YAML into your application.
However, there are other factors to consider. For example, you may have a preference for the resulting in-memory data structures. If you use standard YAML implementations that deserialize to native data structures (PyYAML, SnakeYAML etc), the YAML structure imposes the type of the in-memory data structure (you can customize by writing custom constructors, but that's not trivial). For example, if you want to ask a dataset object for its ID, that is only directly doable with the second structure – if you use the first structure, you would need to search the parent table for the dataset value you have to get its ID.
So, final answer is (as always): It depends. Think about what you want to do with it. For simple configuration files, my second argument may be weaker than my first one, but I don't know what exactly you want to do with the data.

In YAML, must a quoted scalar be interpreted by a parser as a string?

I've seen advice around the Internet that if you want a YAML scalar value to be processed as a string, you should quote it:
foo : "2018-04-17"
In the example above, this advice is intended to tell me that the value 2018-04-17 will be processed by any given YAML parser as its native language's string type. For example, SnakeYAML would, if this advice were true, interpret this as a java.lang.String, and not as a java.util.Date. (As it happens, SnakeYAML interprets this as a java.util.Date, quotes or not, which is why I'm asking this question.)
But although this advice may happen to work with any given parser, I can't see where in the YAML 1.2. specification this advice might come from. The closest thing I can find is the following sentence:
YAML allows scalars to be presented in several formats. For example, the integer “11” might also be written as “0xB”. Tags must specify a mechanism for converting the formatted content to a canonical form for use in equality testing. Like node style, the format is a presentation detail and is not reflected in the serialization tree and representation graph.
And this one:
The scalar style is a presentation detail and must not be used to convey content information, with the exception that plain scalars are distinguished for the purpose of tag resolution.
And this one:
Note that resolution must not consider presentation details such as comments, indentation and node style.
Nevertheless, I see lots of YAML documents that rely on the double-quoting-the-value-means-it-will-be-parsed-as-a-string advice, which makes me think I'm misreading something. Is there contention on this subject?
Relevant section from the YAML 1.1 spec (note that SnakeYaml is YAML 1.1 and therefore, the 1.2 spec does not necessarily apply):
It is not required that all the tags of the complete representation be explicitly specified in the character stream. During parsing, nodes that omit the tag are given a non-specific tag: “?” for plain scalars and “!” for all other nodes. [...]
It is recommended that nodes having the “!” non-specific tag should be resolved as “tag:yaml.org,2002:seq”, “tag:yaml.org,2002:map” or “tag:yaml.org,2002:str” depending on the node’s kind. This convention allows the author of a YAML character stream to exert some measure of control over the tag resolution process. By explicitly specifying a plain scalar has the “!” non-specific tag, the node is resolved as a string, as if it was quoted or written in a block style. Note, however, that each application may override this behavior. For example, an application may automatically detect the type of programming language used in source code presented as a non-plain scalar and resolve it accordingly.
So to sum up, a YAML processor is not required to parse quoted scalars as string, and YAML also does not dictate which native type tag:yaml.org,2002:str does map to. And in fact, most YAML implementations do only follow parts of that advice. For example, if you deserialise YAML into a POJO/JavaBean with SnakeYaml, you typically do not use any explicit tags in your YAML, but your mappings are resolved to the corresponding Java classes in the root class' structure, instead of the generic Map which is what this advice suggests (since all mappings without explicit tags get the ! non-specific tag).
Note that this has been changed in YAML 1.2:
During parsing, nodes lacking an explicit tag are given a non-specific tag: “!” for non-plain scalars, and “?” for all other nodes.
That's closer to most implementations, but for example, if you deserialise into a class class Foo { String bar; }, this will still load although bar is not a string, but a field name:
"bar": some value
So the advice for using YAML is to specify the desired structure on the application side – in SnakeYaml, you would set the root class type, and then every value will be mapped to the required type at its point in the hierarchy, as long as it is able to map there, regardless of whether it is quoted or unquoted. In general, it makes more sense for the application to specify which kind of value it expects throughout the hierarchy instead of the YAML author to do that via quoting. This is also conformant with the YAML spec, which says
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node, and (3) the content (and hence the kind) of the node.
Resolving a tag is the YAML term for determining the target type. And it is allowed to determine the target type based on its position in the hierarchy: The root type is determined by the fact that the element is the root of the YAML document and in the case of SnakeYaml, may be fed in via the API. All other types are determined by the fact that they are descendants from the root type.
Final note: If you really really want something to be a string, !!str 2018-04-17 will do since it sets a specific tag for the node.

Why yaml is popular? Is there anything else that does better. [duplicate]

What are the differences between YAML and JSON, specifically considering the following things?
Performance (encode/decode time)
Memory consumption
Expression clarity
Library availability, ease of use (I prefer C)
I was planning to use one of these two in our embedded system to store configure files.
Related:
Should I use YAML or JSON to store my Perl data?
Technically YAML is a superset of JSON. This means that, in theory at least, a YAML parser can understand JSON, but not necessarily the other way around.
See the official specs, in the section entitled "YAML: Relation to JSON".
In general, there are certain things I like about YAML that are not available in JSON.
As #jdupont pointed out, YAML is visually easier to look at. In fact the YAML homepage is itself valid YAML, yet it is easy for a human to read.
YAML has the ability to reference other items within a YAML file using "anchors." Thus it can handle relational information as one might find in a MySQL database.
YAML is more robust about embedding other serialization formats such as JSON or XML within a YAML file.
In practice neither of these last two points will likely matter for things that you or I do, but in the long term, I think YAML will be a more robust and viable data serialization format.
Right now, AJAX and other web technologies tend to use JSON. YAML is currently being used more for offline data processes. For example, it is included by default in the C-based OpenCV computer vision package, whereas JSON is not.
You will find C libraries for both JSON and YAML. YAML's libraries tend to be newer, but I have had no trouble with them in the past. See for example Yaml-cpp.
Differences:
YAML, depending on how you use it, can be more readable than JSON
JSON is often faster and is probably still interoperable with more systems
It's possible to write a "good enough" JSON parser very quickly
Duplicate keys, which are potentially valid JSON, are definitely invalid YAML.
YAML has a ton of features, including comments and relational anchors. YAML syntax is accordingly quite complex, and can be hard to understand.
It is possible to write recursive structures in yaml: {a: &b [*b]}, which will loop infinitely in some converters. Even with circular detection, a "yaml bomb" is still possible (see xml bomb).
Because there are no references, it is impossible to serialize complex structures with object references in JSON. YAML serialization can therefore be more efficient.
In some coding environments, the use of YAML can allow an attacker to execute arbitrary code.
Observations:
Python programmers are generally big fans of YAML, because of the use of indentation, rather than bracketed syntax, to indicate levels.
Many programmers consider the attachment of "meaning" to indentation a poor choice.
If the data format will be leaving an application's environment, parsed within a UI, or sent in a messaging layer, JSON might be a better choice.
YAML can be used, directly, for complex tasks like grammar definitions, and is often a better choice than inventing a new language.
Bypassing esoteric theory
This answers the title, not the details as most just read the title from a search result on google like me so I felt it was necessary to explain from a web developer perspective.
YAML uses space indentation, which is familiar territory for Python developers.
JavaScript developers love JSON because it is a subset of JavaScript and can be directly interpreted and written inside JavaScript, along with using a shorthand way to declare JSON, requiring no double quotes in keys when using typical variable names without spaces.
There are a plethora of parsers that work very well in all languages for both YAML and JSON.
YAML's space format can be much easier to look at in many cases because the formatting requires a more human-readable approach.
YAML's form while being more compact and easier to look at can be deceptively difficult to hand edit if you don't have space formatting visible in your editor. Tabs are not spaces so that further confuses if you don't have an editor to interpret your keystrokes into spaces.
JSON is much faster to serialize and deserialize because of significantly less features than YAML to check for, which enables smaller and lighter code to process JSON.
A common misconception is that YAML needs less punctuation and is more compact than JSON but this is completely false. Whitespace is invisible so it seems like there are less characters, but if you count the actual whitespace which is necessary to be there for YAML to be interpreted properly along with proper indentation, you will find YAML actually requires more characters than JSON. JSON doesn't use whitespace to represent hierarchy or grouping and can be easily flattened with unnecessary whitespace removed for more compact transport.
The Elephant in the room: The Internet itself
JavaScript so clearly dominates the web by a huge margin and JavaScript developers prefer using JSON as the data format overwhelmingly along with popular web APIs so it becomes difficult to argue using YAML over JSON when doing web programming in the general sense as you will likely be outvoted in a team environment. In fact, the majority of web programmers aren't even aware YAML exists, let alone consider using it.
If you are doing any web programming, JSON is the default way to go because no translation step is needed when working with JavaScript so then you must come up with a better argument to use YAML over JSON in that case.
This question is 6 years old, but strangely, none of the answers really addresses all four points (speed, memory, expressiveness, portability).
Speed
Obviously this is implementation-dependent, but because JSON is so widely used, and so easy to implement, it has tended to receive greater native support, and hence speed. Considering that YAML does everything that JSON does, plus a truckload more, it's likely that of any comparable implementations of both, the JSON one will be quicker.
However, given that a YAML file can be slightly smaller than its JSON counterpart (due to fewer " and , characters), it's possible that a highly optimised YAML parser might be quicker in exceptional circumstances.
Memory
Basically the same argument applies. It's hard to see why a YAML parser would ever be more memory efficient than a JSON parser, if they're representing the same data structure.
Expressiveness
As noted by others, Python programmers tend towards preferring YAML, JavaScript programmers towards JSON. I'll make these observations:
It's easy to memorise the entire syntax of JSON, and hence be very confident about understanding the meaning of any JSON file. YAML is not truly understandable by any human. The number of subtleties and edge cases is extreme.
Because few parsers implement the entire spec, it's even harder to be certain about the meaning of a given expression in a given context.
The lack of comments in JSON is, in practice, a real pain.
Portability
It's hard to imagine a modern language without a JSON library. It's also hard to imagine a JSON parser implementing anything less than the full spec. YAML has widespread support, but is less ubiquitous than JSON, and each parser implements a different subset. Hence YAML files are less interoperable than you might think.
Summary
JSON is the winner for performance (if relevant) and interoperability. YAML is better for human-maintained files. HJSON is a decent compromise although with much reduced portability. JSON5 is a more reasonable compromise, with well-defined syntax.
GIT and YAML
The other answers are good. Read those first. But I'll add one other reason to use YAML sometimes: git.
Increasingly, many programming projects use git repositories for distribution and archival. And, while a git repo's history can equally store JSON and YAML files, the "diff" method used for tracking and displaying changes to a file is line-oriented. Since YAML is forced to be line-oriented, any small changes in a YAML file are easier to see by a human.
It is true, of course, that JSON files can be "made pretty" by sorting the strings/keys and adding indentation. But this is not the default and I'm lazy.
Personally, I generally use JSON for system-to-system interaction. I often use YAML for config files, static files, and tracked files. (I also generally avoid adding YAML relational anchors. Life is too short to hunt down loops.)
Also, if speed and space are really a concern, I don't use either. You might want to look at BSON.
I find YAML to be easier on the eyes: less parenthesis, "" etc. Although there is the annoyance of tabs in YAML... but one gets the hang of it.
In terms of performance/resources, I wouldn't expect big differences between the two.
Futhermore, we are talking about configuration files and so I wouldn't expect a high frequency of encode/decode activity, no?
Technically YAML offers a lot more than JSON (YAML v1.2 is a superset of JSON):
comments
anchors and inheritance - example of 3 identical items:
item1: &anchor_name
name: Test
title: Test title
item2: *anchor_name
item3:
<<: *anchor_name
# You may add extra stuff.
...
Most of the time people will not use those extra features and the main difference is that YAML uses indentation whilst JSON uses brackets. This makes YAML more concise and readable (for the trained eye).
Which one to choose?
YAML extra features and concise notation makes it a good choice for configuration files (non-user provided files).
JSON limited features, wide support, and faster parsing makes it a great choice for interoperability and user provided data.
If you don't need any features which YAML has and JSON doesn't, I would prefer JSON because it is very simple and is widely supported (has a lot of libraries in many languages). YAML is more complex and has less support. I don't think the parsing speed or memory use will be very much different, and maybe not a big part of your program's performance.
Benchmark results
Below are the results of a benchmark to compare YAML vs JSON loading times, on Python and Perl
JSON is much faster, at the expense of some readability, and features such as comments
Test method
100 sequential runs on a fast machine, average number of seconds
The dataset was a 3.44MB JSON file, containing movie data scraped from Wikipedia
https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json
Linked to from: https://github.com/jdorfman/awesome-json-datasets
Results
Python 3.8.3 timeit
JSON: 0.108
YAML CLoader: 3.684
YAML: 29.763
Perl 5.26.2 Benchmark::cmpthese
JSON XS: 0.107
YAML XS: 0.574
YAML Syck: 1.050
Perl 5.26.2 Dumbbench (Brian D Foy, excludes outliers)
JSON XS: 0.102
YAML XS: 0.514
YAML Syck: 1.027
From: Arnaud Lauret Book “The Design of Web APIs.” :
The JSON data format
JSON is a text data format based on how the JavaScript programming language describes data but is, despite its name, completely language-independent (see https://www.json.org/). Using JSON, you can describe objects containing unordered name/value pairs and also arrays or lists containing ordered values, as shown in this figure.
An object is delimited by curly braces ({}). A name is a quoted string ("name") and is sep- arated from its value by a colon (:). A value can be a string like "value", a number like 1.23, a Boolean (true or false), the null value null, an object, or an array. An array is delimited by brackets ([]), and its values are separated by commas (,).
The JSON format is easily parsed using any programming language. It is also relatively easy to read and write. It is widely adopted for many uses such as databases, configura- tion files, and, of course, APIs.
YAML
YAML (YAML Ain’t Markup Language) is a human-friendly, data serialization format. Like JSON, YAML (http://yaml.org) is a key/value data format. The figure shows a comparison of the two.
Note the following points:
There are no double quotes (" ") around property names and values in YAML.
JSON’s structural curly braces ({}) and commas (,) are replaced by newlines and
indentation in YAML.
Array brackets ([]) and commas (,) are replaced by dashes (-) and newlines in
YAML.
Unlike JSON, YAML allows comments beginning with a hash mark (#).
It is relatively easy to convert one of those formats into the other. Be forewarned though, you will lose comments when converting a YAML document to JSON.
Since this question now features prominently when searching for YAML and JSON, it's worth noting one rarely-cited difference between the two: license. JSON purports to have a license which JSON users must adhere to (including the legally-ambiguous "shall be used for Good, not Evil"). YAML carries no such license claim, and that might be an important difference (to your lawyer, if not to you).
Sometimes you don't have to decide for one over the other.
In Go, for example, you can have both at the same time:
type Person struct {
Name string `json:"name" yaml:"name"`
Age int `json:"age" yaml:"age"`
}
I find both YAML and JSON to be very effective. The only two things that really dictate when one is used over the other for me is one, what the language is used most popularly with. For example, if I'm using Java, Javascript, I'll use JSON. For Java, I'll use their own objects, which are pretty much JSON but lacking in some features, and convert it to JSON if I need to or make it in JSON in the first place. I do that because that's a common thing in Java and makes it easier for other Java developers to modify my code. The second thing is whether I'm using it for the program to remember attributes, or if the program is receiving instructions in the form of a config file, in this case I'll use YAML, because it's very easily human read, has nice looking syntax, and is very easy to modify, even if you have no idea how YAML works. Then, the program will read it and convert it to JSON, or whatever is preferred for that language.
In the end, it honestly doesn't matter. Both JSON and YAML are easily read by any experienced programmer.
If you are concerned about better parsing speed then storing the data in JSON is the option. I had to parse the data from a location where the file was subject to modification from other users and hence I used YAML as it provides better readability compared to JSON.
And you can also add comments in the YAML file which can't be done in a JSON file.
JSON encodes six data types: Objects (mappings), Arrays, Strings Numbers, Booleans and Null. It is extremely easy for a machine to parse and provides very little flexibility. The specification is about a page and a half.
YAML allows the encoding of arbitrary Python data and other crazy crap (which leads to vulnerabilities when decoding it). It is hard to parse because it offers so much flexibility. The specification for YAML was 86 pages, the last time I checked. YAML syntax is obviously influenced by Python, but maybe they should have been a little more influenced by the Python philosophy on a few points: e.g. “there should be one—and preferably only one—obvious way to do it” and “simple is better than complex.”
The main benefit of YAML over JSON is that it’s easier for humans to read and edit, which makes it a natural choice for configuration files.
These days, I’m leaning towards TOML for configuration files. It’s not as pretty or as flexible as YAML, but it’s easier both for machines and humans to parse. The syntax is (almost) a superset of INI syntax, but it parses out to JSON-like data structures, adding only one additional type: the date type.

Resources