Use of --- in yaml - ruby

I came across this yaml document:
--- !ruby/object:MyClass
myint: 100
mystring: hello world
What does the line:
--- !ruby/object:MyClass
mean?

In YAML, --- is the end of directives marker.
A YAML document may begin with a number of YAML directives (currently, two directives are defined, %YAML and %TAG). Since a text node (for example) can also start with a % character, there needs to be a way to distinguish between directives and text. This is achieved using the end of directives marker --- which signals the end of the directives and the beginning of the document.
Since directives are allowed to be empty, --- can also serve as a document separator.
YAML also has an end of document marker .... However, this is not often used, because and end of directives marker / document separator also implies the end of the document. You need it if you want to have multiple documents with directives within the same stream or when you want to indicate that a document is finished without necessarily starting a new one (e.g. in cases where there may be significant time passing between the end of one document and the start of another).
Many YAML emitters, and Psych is no exception, always emit an end of directives marker at the beginning of each document. This allows you to easily concatenate multiple documents into a single stream without doing any additional processing of the documents.
The other half of that line, !ruby/object:MyClass, is a tag. A tag is used to give a type to the following node. In YAML, every node has a type, even if it is implicit. You can also write the tag explicitly, for example text nodes have the type (tag) !!str. This can be useful in certain circumstances, for example here:
!!str 2018-10-31
This tells YAML that 2018-10-31 is text, not a date.
!ruby/object:MyClass is a tag used by Psych to indicate that the node is a serialized Ruby Object which is an instance of class MyClass. This way, when deserializing the document, Psych knows what class to instantiate and how to treat the node.

According to yaml.org, '---' indicates the start of a document.
https://yaml.org/spec/1.2/spec.html
for official specifications.

Related

why --- (3 dashes/hyphen) in yaml file?

So I just started using YAML file instead of application.properties as it is more readable. I see in YAML files they start with ---. I googled and found the below explanation.
YAML uses three dashes (“---”) to separate directives from document
content. This also serves to signal the start of a document if no
directives are present.
Also, I tried a sample without --- and understood that it is not mandatory to have them.
I think I don't have a clear understanding of directive and document. Can anyone please explain with a simple example?
As you already found out, the three dashes --- are used to signal the start of a document, i.e.:
To signal the document start after directives, i.e., %YAML or %TAG lines according to the current spec. For example:
%YAML 1.2
%TAG !foo! !foo-types/
---
myKey: myValue
To signal the document start when you have multiple yaml documents in the same stream, e.g., a yaml file:
doc 1
---
doc 2
If doc 2 has some preceding directives, then we have to use three dots ... to indicate the end of doc 1 (and the start of potential directives preceding doc 2) to the parser. For example:
doc 1
...
%TAG !bar! !bar-types/
---
doc 2
The spec is good for yaml parser implementers. However, I find this article easier to read from a user perspective.
It's not mandatory to have them if you do not begin your YAML with a directive. If it's the case, you should use them.
Let's take a look at the documentation
3.2.3.4. Directives
Each document may be associated with a set of directives. A directive has a name and an optional sequence of
parameters. Directives are instructions to the YAML processor, and
like all other presentation details are not reflected in the YAML
serialization tree or representation graph. This version of YAML
defines a two directives, “YAML” and “TAG”. All other directives are
reserved for future versions of YAML.
One example of this can also be found in the documentation for directive YAML
%YAML 1.2 # Attempt parsing
# with a warning
---
"foo"

How to parse USPTO XML files with Ruby and Nokogiri?

It's been the whole day that I'm trying to figure out how to parse USPTO bulk XML files. I've tried to download one of those files, unzipped it and then run:
Nokogiri::XML(File.open('ipg140513.xml'))
But it seems to load only the first element, not all patents (in that file there are few thousands)
What am I doing wrong?
The file you linked to, and presumably the others, are not valid XML because they do not have a root element. From Wikipedia:
Each XML document has exactly one single root element.
Nokogiri hints at this if you look at the errors (suggested by Arup Rakshit), as detailed in the documentation:
Nokogiri::XML(File.open("/Users/b/Downloads/ipg140513.xml")).errors # =>
# [
# #<Nokogiri::XML::SyntaxError: XML declaration allowed only at the start of the document>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>
# ]
The file appears to be a concatenation of a series of valid XML files, each having a <us-patent-grant/> as its root element.
Fortunately, Nokogiri can handle this invalid XML if you process it as a document fragment. Try this:
Nokogiri::XML::DocumentFragment.parse(File.read('ipg140513.xml')).select{|element| element.name == 'us-patent-grant'}
The select chooses the root node of each concatenated document, ignoring the processing instructions and DTD declarations.
Alternately, you could pre-process the file and split it into its constituent, correctly-formatted documents. Parsing a 650MB document all at once is quite slow and memory intensive.

How to remove '---' on top of a YAML file?

I am modifying a YAML file in Ruby. After I write back the modified YAML, I see a --- added on top of the file. How is this getting added and how do I get rid of it?
YAML spec says:
YAML uses three dashes (“---”) to separate directives from document content. This also serves to signal the start of a document if no directives are present.
Example:
# Ranking of 1998 home runs
---
- Mark McGwire
- Sammy Sosa
- Ken Griffey
# Team ranking
---
- Chicago Cubs
- St Louis Cardinals
So if you have multiple documents per YAML file, you have to separate them by three dashes. If you only have one document, you can remove/omit it (I never had a problem with YAML in ruby if three-dashes was missing). The reason why it's added when you yamlify your object is that, I guess, the dumper is written "by the spec" and doesn't care to implement such "shortcuts" (omit three-dashes when it's only one document).

Indenting a YAML sequence inside a mapping

Should the following be valid?
parent:
- child
- child
So what we have is a sequence of values inside a mapping.
The specific question is about whether the indentation for the 2nd and 3rd lines is valid. The Ruby YAML.dump generated this code, but the Yaml parser here rejects it, because the child lines are not indented.
i.e. it wants something like:
parent:
- child
- child
Who is right?
Looking at the YAML spec, it's certainly not obvious, and the line
The “-”, “?” and “:” characters used to denote block collection entries are perceived by people to be part of the indentation
doesn't help much.
Yes, that is legal YAML. The relevant text from the spec is here:
Since people perceive the “-” indicator as indentation, nested block sequences may be indented by one less space to compensate, except, of course, if nested inside another block sequence (block-out context vs. block-in context).
and the subsequent example 8.22:
sequence: !!seq
- entry
- !!seq
- nested
mapping: !!map
foo: bar

How can I make empty tags self-closing with Nokogiri?

I've created an XML template in ERB. I fill it in with data from a database during an export process.
In some cases, there is a null value, in which case an element may be empty, like this:
<someitem>
</someitem>
In that case, the client receiving the export wants it to be converted into a self-closing tag:
<someitem/>
I'm trying to see how to get Nokogiri to do this, but I don't see it yet. Does anybody know how to make empty XML tags self-closing with Nokogiri?
Update
A regex was sufficient to do what I specified above, but the client now also wants tags whose children are all empty to be self-closing. So this:
<someitem>
<subitem>
</subitem>
<subitem>
</subitem>
</someitem>
... should also be
<someitem/>
I think that this will require using Nokogiri.
Search for
<([^>]+)>\s*</\1>
and replace with
<\1/>
In Ruby:
result = subject.gsub(/<([^>]+)>\s*<\/\1>/, '<\1/>')
Explanation:
< # Match opening bracket
( # Match and remember...
[^>]+ # One or more characters except >
) # End of capturing group
> # Match closing bracket
\s* # Match optional whitespace & newlines
< # Match opening bracket
/ # Match /
\1 # Match the contents of the opening tag
> # Match closing bracket
A couple questions:
<foo></foo> is the same as <foo />, so why worry about such a tiny detail? If it is syntactically significant because the text node between the two is a "\n", then put a test in your ERB template that checks for the value that would go there, and if it's not initialized output the self-closing tag instead? See "Yak shaving".
Why involve Nokogiri? You should be able to generate correct XML in ERB since you're in control of the template.
EDIT - Nokogiri's behavior is to not-rewrite parsed XML unless it has to. I suspect you'd have to remove the node in question, then reinsert it as an empty node to get Nokogiri to output what you want.

Resources