YAML deserializer with position information? - yaml

Does anyone know of a YAML deserializer that can provide position information for the constructed objects?
I know how to deserialize a YAML file into a Java object. Simple instructions on http://yamlbeans.sourceforge.net/.
However, I want to do some algorithmic validation on the deserialized object and report error back to the user pointing to the position in the YAML that cause the error.
Example:
=========YAML file==========
name: Nathan Sweet
age: 28
address: 4011 16th Ave S
=======JAVA class======
public class Contact {
public String name;
public int age;
public String address;
}
Imagine if I want to first load the yaml into Contact class and then validate the address against some repository and error back if its invalid. Something like:
'Line 3 Column 9: The address does not match valid entry in the database'
The problem is, currently there is no way to get the position inside a deserialized object from YAML.
Anyone know a solution to this issue?

Most YAML parsers, if they keep any information about positions around they drop it while constructing the language native objects.
In ruamel.yaml ¹, I keep more information around because I want to be able to round-trip with minimal loss of original layout (e.g. keeping comments and key order in mappings).
I don't keep information on individual key-value pairs, but I do on the "upper-left" position of a mapping². Because of the kept order of the mapping items you can give some rather nice feedback. Given an input file:
- name: anthon
age: 53
adres: Rijn en Schiekade 105
- name: Nathan Sweet
age: 28
address: 4011 16th Ave S
And a program that you call with the input file as argument:
#! /usr/bin/env python2.7
# coding: utf-8
# http://stackoverflow.com/questions/30677517/yaml-deserializer-with-position-information?noredirect=1#comment49491314_30677517
import sys
import ruamel.yaml
up_arrow = '↑'
def key_error(key, value, line, col, error, e='E'):
print('E[{}:{}]: {}'.format(line, col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(col), up_arrow))
print('---')
def value_error(key, value, line, col, error, e='E'):
val_col = col + len(key) + 2
print('{}[{}:{}]: {}'.format(e, line, val_col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(val_col), up_arrow))
print('---')
def value_warning(key, value, line, col, error):
value_error(key, value, line, col, error, e='W')
class Contact(object):
def __init__(self, vals):
for offset, k in enumerate(vals):
self.check(k, vals[k], vals.lc.line+offset, vals.lc.col)
for k in ['name', 'address', 'age']:
if k not in vals:
print('K[{}:{}]: {}'.format(
vals.lc.line+offset, vals.lc.col, "missing key: "+k
))
print('---')
def check(self, key, value, line, col):
if key == 'name':
if value[0].lower() == value[0]:
value_error(key, value, line, col,
'value should start with uppercase')
elif key == 'age':
if value < 50:
value_warning(key, value, line, col,
'probably too young for knowing ALGOL 60')
elif key == 'address':
pass
else:
key_error(key, value, line, col,
"unexpected key")
data = ruamel.yaml.load(open(sys.argv[1]), Loader=ruamel.yaml.RoundTripLoader)
for x in data:
contact = Contact(x)
giving you E(rrors), W(arnings) and K(eys missing):
E[0:8]: value should start with uppercase
name: anthon
↑
---
E[2:2]: unexpected key
adres: Rijn en Schiekade 105
↑
---
K[2:2]: missing key: address
---
W[4:7]: probably too young for knowing ALGOL 60
age: 28
↑
---
Which you should be able to parser in a calling program in any language to give feedback. The check method of course need adjusting to your requirements. This is not as nice as being to do that in the language the rest of your application is in, but it might be better than nothing.
In my experience handling the above format is certainly simpler than extending an existing (open source) YAML parser.
¹ Disclaimer: I am the author of that package
² I want to use that kind of information at some point to preserve spurious newlines, inserted for readability

In python, you can readily write custom Dumper/Loader objects and use them to load (or dump) your yaml code. You can have these objects track the file/line info:
import yaml
from collections import OrderedDict
class YamlOrderedDict(OrderedDict):
"""
An OrderedDict that was loaded from a yaml file, and is annotated
with file/line info for reporting about errors in the source file
"""
def _annotate(self, node):
self._key_locs = {}
self._value_locs = {}
nodeiter = node.value.__iter__()
for key in self:
subnode = nodeiter.next()
self._key_locs[key] = subnode[0].start_mark.name + ':' + \
str(subnode[0].start_mark.line+1)
self._value_locs[key] = subnode[1].start_mark.name + ':' + \
str(subnode[1].start_mark.line+1)
def key_loc(self, key):
try:
return self._key_locs[key]
except AttributeError, KeyError:
return ''
def value_loc(self, key):
try:
return self._value_locs[key]
except AttributeError, KeyError:
return ''
# Use YamlOrderedDict objects for yaml maps instead of normal dict
yaml.add_representer(OrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
yaml.add_representer(YamlOrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
def _load_YamlOrderedDict(loader, node):
rv = YamlOrderedDict(loader.construct_pairs(node))
rv._annotate(node)
return rv
yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, _load_YamlOrderedDict)
Now when you read a yaml file, any mapping objects will be read as a YamlOrderedDict, which allows looking up the file location of keys in the mapping object. You can also add an iterator method like:
def iter_with_lines(self):
for key, val in self.items():
yield (key, val, self.key_loc(key))
...and now you can write a loop like:
for key,value,location in obj.iter_with_lines():
# iterate through the key/value pairs in a YamlOrderedDict, with
# the source file location

Related

PyYAML loader with duplicate keys

Using PyYAML for loading a YAML (large) file which has duplicate keys. I would like to preserve all keys and would modify duplicate key according to project need. But it seems PyYAML is silently overwrites results with the last key and not getting a chance to modify it as my need (loss of information), resulting in this dict: {'blocks':{'a':'b2:11 c2:22'}}
simple example YAML:
import yaml
given_str = '''
blocks:
a:
b1:1
c1:2
a:
b2:11
c2:22'''
p = yaml.load(given_str)
How can I load the YAML with duplicate keys so that I get a chance to recursively traverse it and modify keys as my need. I need to load YAML and then transfer it into a database.
Assuming your input YAML has no merge keys ('<<'), no tags and no comments you want
to preserve, you can use the following:
import sys
import ruamel.yaml
from pathlib import Path
from collections.abc import Hashable
file_in = Path('input.yaml')
class MyConstructor(ruamel.yaml.constructor.SafeConstructor):
def construct_mapping(self, node, deep=False):
"""deep is True when creating an object/mapping recursively,
in that case want the underlying elements available during construction
"""
if not isinstance(node, ruamel.yaml.nodes.MappingNode):
raise ConstructorError(
None, None, f'expected a mapping node, but found {node.id!s}', node.start_mark,
)
total_mapping = self.yaml_base_dict_type()
if getattr(node, 'merge', None) is not None:
todo = [(node.merge, False), (node.value, False)]
else:
todo = [(node.value, True)]
for values, check in todo:
mapping: Dict[Any, Any] = self.yaml_base_dict_type()
for key_node, value_node in values:
# keys can be list -> deep
key = self.construct_object(key_node, deep=True)
# lists are not hashable, but tuples are
if not isinstance(key, Hashable):
if isinstance(key, list):
key = tuple(key)
if not isinstance(key, Hashable):
raise ConstructorError(
'while constructing a mapping',
node.start_mark,
'found unhashable key',
key_node.start_mark,
)
value = self.construct_object(value_node, deep=deep)
if key in mapping:
pat = key + '_undup_{}'
index = 0
while True:
nkey = pat.format(index)
if nkey not in mapping:
key = nkey
break
index += 1
mapping[key] = value
total_mapping.update(mapping)
return total_mapping
yaml = ruamel.yaml.YAML(typ='safe')
yaml.default_flow_style = False
yaml.Constructor = MyConstructor
data = yaml.load(file_in)
yaml.dump(data, sys.stdout)
which gives:
blocks:
a: b1:1 c1:2
a_undup_0: b2:11 c2:22
Please note that the values for both a keys are multiline plain scalars. For b1 and c1 to be a key
the mapping value indicator (:, the colon) needs to be followed by a whitespace character:
a:
b1: 1
c1: 2
After reading many forums, I think best solution is create a wrapper for yml loader (removing duplicates) is the solution. #Anthon - any comment?
import yaml
from collections import defaultdict, Counter
####### Preserving Duplicate ###################
def parse_preserving_duplicates(input_file):
class PreserveDuplicatesLoader(yaml.CLoader):
pass
def map_constructor(loader, node, deep=False):
"""Walk tree, removing degeneracy in any duplicate keys"""
keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
key_count = Counter(keys)
data = defaultdict(dict) # map all data removing duplicates
c = 0
for key, value in zip(keys, vals):
if key_count[key] > 1:
data[f'{key}{c}'] = value
c += 1
else:
data[key] = value
return data
PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
map_constructor)
return yaml.load(input_file, PreserveDuplicatesLoader)
##########################################################
with open(inputf, 'r') as file:
fp = parse_preserving_duplicates(input_file)

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).
I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.
In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.
^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.
What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?
Emitter.expect_flow_mapping() probably too late for converting flow->block
Serializer.serialize_node() probably too late as it consults node.flow_style
RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
I could also walk the data before calling yaml.dump(), but this has no idea about line length.
So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?
What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.
Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.
A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.
If you run the following, the initial dumped value for quote will be on one line.
The change_to_block method as presented changes all mappings/sequences that are too long
that are on one line.
import sys
import ruamel.yaml
yaml_str = """\
movie: bladerunner
quote: {[Batty, Roy]: [
I have seen things you people wouldn't believe.,
Attack ships on fire off the shoulder of Orion.,
I watched C-beams glitter in the dark near the Tannhäuser Gate.,
]}
"""
class Blockify:
def __init__(self, width, only_first=False, verbose=0):
self._width = width
self._yaml = None
self._only_first = only_first
self._verbose = verbose
#property
def yaml(self):
if self._yaml is None:
self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
y.preserve_quotes = True
y.width = 2**16
return self._yaml
def __call__(self, d):
pass_nr = 0
changed = [True]
while changed[0]:
changed[0] = False
try:
s = self.yaml.dumps(d)
except AttributeError:
print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
sys.exit(1)
if self._verbose > 1:
print(s)
too_long = set()
max_ll = -1
for line_nr, line in enumerate(s.splitlines()):
if len(line) > self._width:
too_long.add(line_nr)
if len(line) > max_ll:
max_ll = len(line)
if self._verbose > 0:
print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
sys.stdout.flush()
new_d = self.yaml.load(s)
self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
d = new_d
pass_nr += 1
return d, s
#staticmethod
def change_to_block(d, too_long, changed, only_first):
if isinstance(d, dict):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
# don't change keys if any value is changed
for v in d.values():
Blockify.change_to_block(v, too_long, changed, only_first)
if only_first and changed[0]:
return
if changed[0]: # don't change keys if value has changed
return
for k in d:
Blockify.change_to_block(k, too_long, changed, only_first)
if only_first and changed[0]:
return
if isinstance(d, (list, tuple)):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
for elem in d:
Blockify.change_to_block(elem, too_long, changed, only_first)
if only_first and changed[0]:
return
blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data = yaml.load(yaml_str)
blockified_data, string_output = blockify(data)
print('-'*32, 'result:', '-'*32)
print(string_output) # string_output has no final newline
which gives:
movie: bladerunner
quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
pass: 0, lines: [1], longest: 186
movie: bladerunner
quote:
[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
pass: 1, lines: [2], longest: 179
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
pass: 2, lines: [], longest: 67
-------------------------------- result: --------------------------------
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style
because the tuple subclass CommentedKeySeq does never get a line number attached.

How can I use the ruamel.yaml rtsc mode?

I've been working on creating a YAML re-formatter based on ruamel.yaml (which you can see here).
I'm currently using version 0.17.20.
Cleaning up comments and whitespace has been difficult. I want to:
ensure there is only one space before the # for EOL comments
align full line comments with the key or item immediately following
remove duplicate blank lines so there is at most one blank line
To get closer to achieving that, I have a custom Emitter class where I extend write_comment to adjust the comments just before writing with super().write_comment(...). However, the Emitter does not know about which key or item comes next because comments are generally attached as post comments.
As I've studied the ruamel.yaml code to figure out how to do this, I found the rtsc mode (Round Trip Split Comments) which looks fantastic because it separates EOLComment, BlankLineComment and FullLineComment instead of lumping them together.
From what I can tell, the Parser and Scanner have been adjusted to capture the comments. So, loading is (mostly?) implemented with this "NEWCMNT" implementation. But Emitter.write_comment expects CommentToken instead of comment line numbers, so dumping does not work yet.
If I update my Emitter.write_comment method, is that enough to finish dumping? Or what else might be necessary? In one of my tries, I ran into a sys.exit in ScannedComments.assign_eol() - what else is needed to finish that?
PS: I wouldn't normally ask how to collaborate on StackOverflow, but this is not a bug report or a feature request, and I'm trying/failing to use a new (undocumented) feature, so I'm filing this here instead of sourceforge.
rtsc is work in progress cq work started but unfinished. It's internals will almost certainly change.
Two of the three points you indicate can relatively easy be implemented:
set the column of each comment to 0 ( by recursively going over a loaded data structure similar to here ) if the column is before the position of the end of the value on a line, you'll get one space between the value and the column
at the same time doing the recursion in the previous point. Take each comment value and do something like:
value = '\n'.join(line.strip() for line in value.splitlines()
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
The indentation to the following element is difficult, depends on the
data structure etc. Given that these are full line comments, I suggest
you do some postprocessing of the YAML document you generate:
find a full line comment, gather full line comments until next line is
not full line comment (i.e. some "real" YAML). Since full line comments
are in column[0] if the previous stuff is applied, you don't have to
track if you are in a (multi-line) literal or folded scalar string where
one of the lines happens to start with #
determine number of spaces
before real YAML and apply these to the full line comments.
import sys
import ruamel.yaml
yaml_str = """\
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate
"""
def redo_comments(d):
def do_one(comment):
if not comment:
return
comment.column = 0
value = '\n'.join(line.strip() for line in comment.value.splitlines()) + '\n'
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
comment.value = value
def do_values(v):
for x in v:
for comment in x:
do_one(comment)
def do_loc(v):
if v is None:
return
do_one(v[0])
if not v[1]:
return
for comment in v[1]:
do_one(comment)
if isinstance(d, dict):
do_loc(d.ca.comment)
do_values(d.ca.items.values())
for val in d.values():
redo_comments(val)
elif isinstance(d, list):
do_values(d.ca.items.values())
for elem in d:
redo_comments(elem)
def realign_full_line_comments(s):
res = []
buf = []
for line in s.splitlines(True):
if not buf:
if line and line[0] == '#':
buf.append(line)
else:
res.append(line)
else:
if line[0] in '#\n':
buf.append(line)
else:
# YAML line, determine indent
count = 0
while line[count] == ' ':
count += 1
if count > len(line):
break # superfluous?
indent = ' ' * count
for cline in buf:
if cline[0] == '\n': # empty
res.append(cline)
else:
res.append(indent + cline)
buf = []
res.append(line)
return ''.join(res)
yaml = ruamel.yaml.YAML()
# yaml.indent(mapping=4, sequence=4, offset=2)
# yaml.preserve_quotes = True
data = yaml.load(yaml_str)
redo_comments(data)
yaml.dump(data, sys.stdout, transform=realign_full_line_comments)
which gives:
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate

using construct_undefined in ruamel from_yaml

I'm creating a custom yaml tag MyTag. It can contain any given valid yaml - map, scalar, anchor, sequence etc.
How do I implement class MyTag to model this tag so that ruamel parses the contents of a !mytag in exactly the same way as it would parse any given yaml? The MyTag instance just stores whatever the parsed result of the yaml contents is.
The following code works, and the asserts should should demonstrate exactly what it should do and they all pass.
But I'm not sure if it's working for the right reasons. . . Specifically in the from_yaml class method, is using commented_obj = constructor.construct_undefined(node) a recommended way of achieving this, and is consuming 1 and only 1 from the yielded generator correct? It's not just working by accident?
Should I instead be using something like construct_object, or construct_map or. . .? The examples I've been able to find tend to know what type it is constructing, so would either use construct_map or construct_sequence to pick which type of object to construct. In this case I effectively want to piggy-back of the usual/standard ruamel parsing for whatever unknown type there might be in there, and just store it in its own type.
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap, CommentedSeq, TaggedScalar
class MyTag():
yaml_tag = '!mytag'
def __init__(self, value):
self.value = value
#classmethod
def from_yaml(cls, constructor, node):
commented_obj = constructor.construct_undefined(node)
flag = False
for data in commented_obj:
if flag:
raise AssertionError('should only be 1 thing in generator??')
flag = True
return cls(data)
with open('mytag-sample.yaml') as yaml_file:
yaml_parser = ruamel.yaml.YAML()
yaml_parser.register_class(MyTag)
yaml = yaml_parser.load(yaml_file)
custom_tag_with_list = yaml['root'][0]['arb']['k2']
assert type(custom_tag_with_list) is MyTag
assert type(custom_tag_with_list.value) is CommentedSeq
print(custom_tag_with_list.value)
standard_list = yaml['root'][0]['arb']['k3']
assert type(standard_list) is CommentedSeq
assert standard_list == custom_tag_with_list.value
custom_tag_with_map = yaml['root'][1]['arb']
assert type(custom_tag_with_map) is MyTag
assert type(custom_tag_with_map.value) is CommentedMap
print(custom_tag_with_map.value)
standard_map = yaml['root'][1]['arb_no_tag']
assert type(standard_map) is CommentedMap
assert standard_map == custom_tag_with_map.value
custom_tag_scalar = yaml['root'][2]
assert type(custom_tag_scalar) is MyTag
assert type(custom_tag_scalar.value) is TaggedScalar
standard_tag_scalar = yaml['root'][3]
assert type(standard_tag_scalar) is str
assert standard_tag_scalar == str(custom_tag_scalar.value)
And some sample yaml:
root:
- item: blah
arb:
k1: v1
k2: !mytag
- one
- two
- three-k1: three-v1
three-k2: three-v2
three-k3: 123 # arb comment
three-k4:
- a
- b
- True
k3:
- one
- two
- three-k1: three-v1
three-k2: three-v2
three-k3: 123 # arb comment
three-k4:
- a
- b
- True
- item: argh
arb: !mytag
k1: v1
k2: 123
# blah line 1
# blah line 2
k3:
k31: v31
k32:
- False
- string here
- 321
arb_no_tag:
k1: v1
k2: 123
# blah line 1
# blah line 2
k3:
k31: v31
k32:
- False
- string here
- 321
- !mytag plain scalar
- plain scalar
- item: no comment
arb:
- one1
- two2
In YAML you can have anchors and aliases, and it is perfectly fine to have an object be a child of itself (using an alias). If you want to dump the Python data structure data:
data = [1, 2, 4, dict(a=42)]
data[3]['b'] = data
it dumps to:
&id001
- 1
- 2
- 4
- a: 42
b: *id001
and for that anchors and aliases are necessary.
When loading such a construct, ruamel.yaml recurses into the nested data structures, but if the toplevel node has not caused a real object to be constructed to which the anchor can be made a reference, the recursive leaf cannot resolve the alias.
To solve that, a generator is used, except for scalar values. It first creates an empty object, then recurses and updates it values. In code calling the constructor a check is made to see if a generator is returned, and in that case next() is done on the data, and potential self-recursion "resolved".
Because you call construct_undefined(), you always get a generator. Practically that method could return a value if it detects a scalar node (which of course cannot recurse), but it doesn't. If it would, your code could then not load the following YAML document:
!mytag 1
without modifications that test if you get a generator or not, as is done in the code in ruamel.yaml calling the various constructors so it can handle both construct_undefined and e.g. construct_yaml_int (which is not a generator).

YAML mapping order not preserved when using alias and yamlordereddictloader loader

I want to load a YAML file into Python as an OrderedDict. I am using yamlordereddictloader to preserve ordering.
However, I notice that the aliased object is placed "too soon" in the OrderedDict in the output.
How can I preserve the order of this mapping when read into Python, ideally as an OrderedDict? Is it possible to achieve this result without writing some custom parsing?
Notes:
I'm not particularly concerned with the method used, as long as the end result is the same.
Using sequences instead of mappings is problematic because they can result in nested output, and I can't simply flatten everything (some nestedness is appropriate).
When I try to just use !!omap, I cannot seem to merge the aliased mapping (d1.dt) into the d2 mapping.
I'm in Python 3.6, if I don't use this loader or !!omap order is not preserved (apparently contrary to the top 'Update' here: https://stackoverflow.com/a/21912744/2343633)
import yaml
import yamlordereddictloader
yaml_file = """
d1:
id:
nm1: val1
dt: &dt
nm2: val2
nm3: val3
d2: # expect nm4, nm2, nm3
nm4: val4
<<: *dt
"""
out = yaml.load(yaml_file, Loader=yamlordereddictloader.Loader)
keys = [x for x in out['d2']]
print(keys) # ['nm2', 'nm3', 'nm4']
assert keys==['nm4', 'nm2', 'nm3'], "order from YAML file is not preserved, aliased keys placed too early"
Is it possible to achieve this result without writing some custom parsing?
Yes. You need to override the method flatten_mapping from SafeConstructor. Here's a basic working example:
import yaml
import yamlordereddictloader
from yaml.constructor import *
from yaml.reader import *
from yaml.parser import *
from yaml.resolver import *
from yaml.composer import *
from yaml.scanner import *
from yaml.nodes import *
class MyLoader(yamlordereddictloader.Loader):
def __init__(self, stream):
yamlordereddictloader.Loader.__init__(self, stream)
# taken from here and reengineered to keep order:
# https://github.com/yaml/pyyaml/blob/5.3.1/lib/yaml/constructor.py#L207
def flatten_mapping(self, node):
merged = []
def merge_from(node):
if not isinstance(node, MappingNode):
raise yaml.ConstructorError("while constructing a mapping",
node.start_mark, "expected mapping for merging, but found %s" %
node.id, node.start_mark)
self.flatten_mapping(node)
merged.extend(node.value)
for index in range(len(node.value)):
key_node, value_node = node.value[index]
if key_node.tag == u'tag:yaml.org,2002:merge':
if isinstance(value_node, SequenceNode):
for subnode in value_node.value:
merge_from(subnode)
else:
merge_from(value_node)
else:
if key_node.tag == u'tag:yaml.org,2002:value':
key_node.tag = u'tag:yaml.org,2002:str'
merged.append((key_node, value_node))
node.value = merged
yaml_file = """
d1:
id:
nm1: val1
dt: &dt
nm2: val2
nm3: val3
d2: # expect nm4, nm2, nm3
nm4: val4
<<: *dt
"""
out = yaml.load(yaml_file, Loader=MyLoader)
keys = [x for x in out['d2']]
print(keys)
assert keys==['nm4', 'nm2', 'nm3'], "order from YAML file is not preserved, aliased keys placed too early"
This has not the best performance as it basically copies all key-value pairs from all mappings once each during loading, but it's working. Performance enhancement is left as an exercise for the reader :).

Resources