Keep indentation for array value in parsed YAML frontmatter - ruby

I am using https://github.com/waiting-for-dev/front_matter_parser to parse and update values in the frontmatter of my markdown files.
The following code removes the original indentation of two spaces for array values:
require 'front_matter_parser'
class FrontMatterUpdater
def self.run(path)
parsed = FrontMatterParser::Parser.parse_file(path)
front_matter = parsed.front_matter
front_matter['redirect_from'] = Array(front_matter['redirect_from'])
front_matter['redirect_from'] << 'new_entry2'
front_matter['redirect_from'].uniq!
new_content = [YAML.dump(front_matter), '---', "\n\n", parsed.content].join
File.write(path, new_content)
end
end
FrontMatterUpdater.run('test.md')
Content of my test.md file (same path):.
---
redirect_from:
- new_entry0
- new_entry1
---
Script result (indentation has been removed by the parser):
---
redirect_from:
- new_entry0
- new_entry1
- new_entry2
---
YAML indentation for array in hash confirms that both versions (indented and not) is valid YAML syntax.
But I'd love to keep the indentation for readability.
Do you see any option to keep the indentation of 2 spaces for the array values?

Related

How can I use the ruamel.yaml rtsc mode?

I've been working on creating a YAML re-formatter based on ruamel.yaml (which you can see here).
I'm currently using version 0.17.20.
Cleaning up comments and whitespace has been difficult. I want to:
ensure there is only one space before the # for EOL comments
align full line comments with the key or item immediately following
remove duplicate blank lines so there is at most one blank line
To get closer to achieving that, I have a custom Emitter class where I extend write_comment to adjust the comments just before writing with super().write_comment(...). However, the Emitter does not know about which key or item comes next because comments are generally attached as post comments.
As I've studied the ruamel.yaml code to figure out how to do this, I found the rtsc mode (Round Trip Split Comments) which looks fantastic because it separates EOLComment, BlankLineComment and FullLineComment instead of lumping them together.
From what I can tell, the Parser and Scanner have been adjusted to capture the comments. So, loading is (mostly?) implemented with this "NEWCMNT" implementation. But Emitter.write_comment expects CommentToken instead of comment line numbers, so dumping does not work yet.
If I update my Emitter.write_comment method, is that enough to finish dumping? Or what else might be necessary? In one of my tries, I ran into a sys.exit in ScannedComments.assign_eol() - what else is needed to finish that?
PS: I wouldn't normally ask how to collaborate on StackOverflow, but this is not a bug report or a feature request, and I'm trying/failing to use a new (undocumented) feature, so I'm filing this here instead of sourceforge.
rtsc is work in progress cq work started but unfinished. It's internals will almost certainly change.
Two of the three points you indicate can relatively easy be implemented:
set the column of each comment to 0 ( by recursively going over a loaded data structure similar to here ) if the column is before the position of the end of the value on a line, you'll get one space between the value and the column
at the same time doing the recursion in the previous point. Take each comment value and do something like:
value = '\n'.join(line.strip() for line in value.splitlines()
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
The indentation to the following element is difficult, depends on the
data structure etc. Given that these are full line comments, I suggest
you do some postprocessing of the YAML document you generate:
find a full line comment, gather full line comments until next line is
not full line comment (i.e. some "real" YAML). Since full line comments
are in column[0] if the previous stuff is applied, you don't have to
track if you are in a (multi-line) literal or folded scalar string where
one of the lines happens to start with #
determine number of spaces
before real YAML and apply these to the full line comments.
import sys
import ruamel.yaml
yaml_str = """\
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate
"""
def redo_comments(d):
def do_one(comment):
if not comment:
return
comment.column = 0
value = '\n'.join(line.strip() for line in comment.value.splitlines()) + '\n'
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
comment.value = value
def do_values(v):
for x in v:
for comment in x:
do_one(comment)
def do_loc(v):
if v is None:
return
do_one(v[0])
if not v[1]:
return
for comment in v[1]:
do_one(comment)
if isinstance(d, dict):
do_loc(d.ca.comment)
do_values(d.ca.items.values())
for val in d.values():
redo_comments(val)
elif isinstance(d, list):
do_values(d.ca.items.values())
for elem in d:
redo_comments(elem)
def realign_full_line_comments(s):
res = []
buf = []
for line in s.splitlines(True):
if not buf:
if line and line[0] == '#':
buf.append(line)
else:
res.append(line)
else:
if line[0] in '#\n':
buf.append(line)
else:
# YAML line, determine indent
count = 0
while line[count] == ' ':
count += 1
if count > len(line):
break # superfluous?
indent = ' ' * count
for cline in buf:
if cline[0] == '\n': # empty
res.append(cline)
else:
res.append(indent + cline)
buf = []
res.append(line)
return ''.join(res)
yaml = ruamel.yaml.YAML()
# yaml.indent(mapping=4, sequence=4, offset=2)
# yaml.preserve_quotes = True
data = yaml.load(yaml_str)
redo_comments(data)
yaml.dump(data, sys.stdout, transform=realign_full_line_comments)
which gives:
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate

pyparsing: match words on same line

In pyparsing I'm looking for a simple way to match words (or other expressions) that occur on the same line, i.e. without any newline in between them.
You can override the default whitespace-skipping characters for a particular parser element - in this case, the word_on_the_same_line only skips spaces, but not newlines.
import pyparsing as pp
word = pp.Word(pp.alphas, pp.alphanums)
# define special whitespace skipping, so that newlines aren't
# skipped when matching a word_on_the_same_line
word_on_the_same_line = word().setWhitespaceChars(" ")
# compare results with this version of word_on_the_same_line to see
# how pyparsing treats newlines as skippable whitespace
# word_on_the_same_line = word()
line = pp.Group(word("key") + word_on_the_same_line[...]("values"))
test = """\
key1 lsdkjf lskdjf lskjdf sldkjf
key2 sdlkjf lskdj lkjss lsdj
"""
print(line[...].parseString(test).dump())
Prints:
[['key1', 'lsdkjf', 'lskdjf', 'lskjdf', 'sldkjf'], ['key2', 'sdlkjf', 'lskdj', 'lkjss', 'lsdj']]
[0]:
['key1', 'lsdkjf', 'lskdjf', 'lskjdf', 'sldkjf']
- key: 'key1'
- values: ['lsdkjf', 'lskdjf', 'lskjdf', 'sldkjf']
[1]:
['key2', 'sdlkjf', 'lskdj', 'lkjss', 'lsdj']
- key: 'key2'
- values: ['sdlkjf', 'lskdj', 'lkjss', 'lsdj']

using construct_undefined in ruamel from_yaml

I'm creating a custom yaml tag MyTag. It can contain any given valid yaml - map, scalar, anchor, sequence etc.
How do I implement class MyTag to model this tag so that ruamel parses the contents of a !mytag in exactly the same way as it would parse any given yaml? The MyTag instance just stores whatever the parsed result of the yaml contents is.
The following code works, and the asserts should should demonstrate exactly what it should do and they all pass.
But I'm not sure if it's working for the right reasons. . . Specifically in the from_yaml class method, is using commented_obj = constructor.construct_undefined(node) a recommended way of achieving this, and is consuming 1 and only 1 from the yielded generator correct? It's not just working by accident?
Should I instead be using something like construct_object, or construct_map or. . .? The examples I've been able to find tend to know what type it is constructing, so would either use construct_map or construct_sequence to pick which type of object to construct. In this case I effectively want to piggy-back of the usual/standard ruamel parsing for whatever unknown type there might be in there, and just store it in its own type.
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap, CommentedSeq, TaggedScalar
class MyTag():
yaml_tag = '!mytag'
def __init__(self, value):
self.value = value
#classmethod
def from_yaml(cls, constructor, node):
commented_obj = constructor.construct_undefined(node)
flag = False
for data in commented_obj:
if flag:
raise AssertionError('should only be 1 thing in generator??')
flag = True
return cls(data)
with open('mytag-sample.yaml') as yaml_file:
yaml_parser = ruamel.yaml.YAML()
yaml_parser.register_class(MyTag)
yaml = yaml_parser.load(yaml_file)
custom_tag_with_list = yaml['root'][0]['arb']['k2']
assert type(custom_tag_with_list) is MyTag
assert type(custom_tag_with_list.value) is CommentedSeq
print(custom_tag_with_list.value)
standard_list = yaml['root'][0]['arb']['k3']
assert type(standard_list) is CommentedSeq
assert standard_list == custom_tag_with_list.value
custom_tag_with_map = yaml['root'][1]['arb']
assert type(custom_tag_with_map) is MyTag
assert type(custom_tag_with_map.value) is CommentedMap
print(custom_tag_with_map.value)
standard_map = yaml['root'][1]['arb_no_tag']
assert type(standard_map) is CommentedMap
assert standard_map == custom_tag_with_map.value
custom_tag_scalar = yaml['root'][2]
assert type(custom_tag_scalar) is MyTag
assert type(custom_tag_scalar.value) is TaggedScalar
standard_tag_scalar = yaml['root'][3]
assert type(standard_tag_scalar) is str
assert standard_tag_scalar == str(custom_tag_scalar.value)
And some sample yaml:
root:
- item: blah
arb:
k1: v1
k2: !mytag
- one
- two
- three-k1: three-v1
three-k2: three-v2
three-k3: 123 # arb comment
three-k4:
- a
- b
- True
k3:
- one
- two
- three-k1: three-v1
three-k2: three-v2
three-k3: 123 # arb comment
three-k4:
- a
- b
- True
- item: argh
arb: !mytag
k1: v1
k2: 123
# blah line 1
# blah line 2
k3:
k31: v31
k32:
- False
- string here
- 321
arb_no_tag:
k1: v1
k2: 123
# blah line 1
# blah line 2
k3:
k31: v31
k32:
- False
- string here
- 321
- !mytag plain scalar
- plain scalar
- item: no comment
arb:
- one1
- two2
In YAML you can have anchors and aliases, and it is perfectly fine to have an object be a child of itself (using an alias). If you want to dump the Python data structure data:
data = [1, 2, 4, dict(a=42)]
data[3]['b'] = data
it dumps to:
&id001
- 1
- 2
- 4
- a: 42
b: *id001
and for that anchors and aliases are necessary.
When loading such a construct, ruamel.yaml recurses into the nested data structures, but if the toplevel node has not caused a real object to be constructed to which the anchor can be made a reference, the recursive leaf cannot resolve the alias.
To solve that, a generator is used, except for scalar values. It first creates an empty object, then recurses and updates it values. In code calling the constructor a check is made to see if a generator is returned, and in that case next() is done on the data, and potential self-recursion "resolved".
Because you call construct_undefined(), you always get a generator. Practically that method could return a value if it detects a scalar node (which of course cannot recurse), but it doesn't. If it would, your code could then not load the following YAML document:
!mytag 1
without modifications that test if you get a generator or not, as is done in the code in ruamel.yaml calling the various constructors so it can handle both construct_undefined and e.g. construct_yaml_int (which is not a generator).

Sort two text files with its indented text aligned to it

I would like to compare two of my log files generated before and after an implementation to see if it has impacted anything. However, the order of the logs I get is not the same all the time. Since, the log file also has multiple indented lines, when I tried to sort, everything is sorted. But, I would like to keep the child intact with the parent. Indented lines are spaces and not tab.
Any help would be greatly appreciated. I am fine with any windows solution or Linux one.
Eg of the file:
#This is a sample code
Parent1 to be verified
Child1 to be verified
Child2 to be verified
Child21 to be verified
Child23 to be verified
Child22 to be verified
Child221 to be verified
Child4 to be verified
Child5 to be verified
Child53 to be verified
Child52 to be verified
Child522 to be verified
Child521 to be verified
Child3 to be verified
I am posting another answer here to sort it hierarchically, using python.
The idea is to attach the parents to the children to make sure that the children under the same parent are sorted together.
See the python script below:
"""Attach parent to children in an indentation-structured text"""
from typing import Tuple, List
import sys
# A unique separator to separate the parent and child in each line
SEPARATOR = '#'
# The indentation
INDENT = ' '
def parse_line(line: str) -> Tuple[int, str]:
"""Parse a line into indentation level and its content
with indentation stripped
Args:
line (str): One of the lines from the input file, with newline ending
Returns:
Tuple[int, str]: The indentation level and the content with
indentation stripped.
Raises:
ValueError: If the line is incorrectly indented.
"""
# strip the leading white spaces
lstripped_line = line.lstrip()
# get the indentation
indent = line[:-len(lstripped_line)]
# Let's check if the indentation is correct
# meaning it should be N * INDENT
n = len(indent) // len(INDENT)
if INDENT * n != indent:
raise ValueError(f"Wrong indentation of line: {line}")
return n, lstripped_line.rstrip('\r\n')
def format_text(txtfile: str) -> List[str]:
"""Format the text file by attaching the parent to it children
Args:
txtfile (str): The text file
Returns:
List[str]: A list of formatted lines
"""
formatted = []
par_indent = par_line = None
with open(txtfile) as ftxt:
for line in ftxt:
# get the indentation level and line without indentation
indent, line_noindent = parse_line(line)
# level 1 parents
if indent == 0:
par_indent = indent
par_line = line_noindent
formatted.append(line_noindent)
# children
elif indent > par_indent:
formatted.append(par_line +
SEPARATOR * (indent - par_indent) +
line_noindent)
par_indent = indent
par_line = par_line + SEPARATOR + line_noindent
# siblings or dedentation
else:
# We just need first `indent` parts of parent line as our prefix
prefix = SEPARATOR.join(par_line.split(SEPARATOR)[:indent])
formatted.append(prefix + SEPARATOR + line_noindent)
par_indent = indent
par_line = prefix + SEPARATOR + line_noindent
return formatted
def sort_and_revert(lines: List[str]):
"""Sort the formatted lines and revert the leading parents
into indentations
Args:
lines (List[str]): list of formatted lines
Prints:
The sorted and reverted lines
"""
sorted_lines = sorted(lines)
for line in sorted_lines:
if SEPARATOR not in line:
print(line)
else:
leading, _, orig_line = line.rpartition(SEPARATOR)
print(INDENT * (leading.count(SEPARATOR) + 1) + orig_line)
def main():
"""Main entry"""
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <file>")
sys.exit(1)
formatted = format_text(sys.argv[1])
sort_and_revert(formatted)
if __name__ == "__main__":
main()
Let's save it as format.py, and we have a test file, say test.txt:
parent2
child2-1
child2-1-1
child2-2
parent1
child1-2
child1-2-2
child1-2-1
child1-1
Let's test it:
$ python format.py test.txt
parent1
child1-1
child1-2
child1-2-1
child1-2-2
parent2
child2-1
child2-1-1
child2-2
If you wonder how the format_text function formats the text, here is the intermediate results, which also explains why we could make file sorted as we wanted:
parent2
parent2#child2-1
parent2#child2-1#child2-1-1
parent2#child2-2
parent1
parent1#child1-2
parent1#child1-2#child1-2-2
parent1#child1-2#child1-2-1
parent1#child1-1
You may see that each child has its parents attached, all the way along to the root. So that the children under the same parent are sorted together.
Short answer (Linux solution):
sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
Test it out:
test.txt
parent2
child2-1
child2-1-1
child2-2
parent1
child1-1
child1-2
child1-2-1
$ sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
parent1
child1-1
child1-2
child1-2-1
parent2
child2-1
child2-1-1
child2-2
Explanation:
The idea is to replace the newline followed by an indentation/space with a non newline character, which has to be unique in your file (here I used # for example, if it is not unique in your file, use other characters or even a string), because we need to turn it back the newline and indentation/space later.
About sed command:
:a create a label 'a'
N append the next line to the pattern space
$! if not the last line, ba branch (go to) label 'a'
s substitute, /\n / regex for newline followed by a space
/#/ a unique character to replace the newline and space
if it is not unique in your file, use other characters or even a string
/g global match (as many times as it can)

Handling Multiline Cells in Ruby CSV

require 'csv'
input = CSV.read("test_first.csv", :encoding => 'ascii')[1 .. -1]
DOC = "test_final.csv"
profile = []
profile[0] = "Multiline"
profile[1] = "Standard"
CSV.open(DOC, mode = 'w', :force_quotes => true) do |me|
me << profile
end
a = 0
b = input.length
while a < b
temp = []
temp = input[a]
profile = []
profile[0] = ' A text string with embedded newlines
as well as some substitution from the source file: '"#{temp[0]}"''
profile[1] = temp[1]
CSV.open(DOC, mode = "a", :force_quotes => true ) do |me|
me << profile
end
a += 1
end
The resulting file test_final.csv retains the desired newlines within the cells however the force_quotes isn't working as expected and the embedded newlines aren't being escaped. So when the test_final.csv is reviewed in a text editor it has more lines than intended because each newline is being treated as a new row.
I tried appending the offending column with it's own unique row separator> profile[0] = ' A text string with embedded newlines
as well as some substitution from the source file: '"#{temp[0]}"'~' and assigning that in the options hash like so> :row_sep => "~" but this didn't seem to work.
Some clarification as to the desired input and output
Desired Input:
test_first.csv
1.testHeader1,testheader2
2.sub1,standard1
3.sub2,standard2
Desired Output:
test_final.csv
1.Multiline,Standard
2. A text string with embedded newlines
as well as some substitution from the source file: sub1,standard1
3. A text string with embedded newlines
as well as some substitution from the source file: sub2,standard2
What I'm getting right now instead is:
test_fail.csv
1.Multiline,Standard
2. A text string with embedded newlines
3. as well as some substitution from the source file: sub1,standard1
4. A text string with embedded newlines
5. as well as some substitution from the source file: sub2,standard2

Resources