Reading and writing back yaml files with multi-line strings - yaml

I have to read a yaml file, modify it and write back using pyYAML. Every thing works fine except when there is multi-line string values in single quotes e.g. if input yaml file looks like
FOO:
- Bar: '{"HELLO":
"WORLD"}'
then reading it as data=yaml.load(open("foo.yaml")) and writing it yaml.dump(data, fref, default_flow_style=False) generates something like
FOO:
- Bar: '{"HELLO": "WORLD"}'
i.e. without the extra line for Bar value. Strange thing is that if input file has something like
FOO:
- Bar: '{"HELLO":
"WORLD"}'
i.e. one extra new line for Bar value then writing it back generates the correct number of new lines. Any idea what I am doing wrong?

You are not doing anything wrong, but you probably should have read more of the YAML specification.
According to the (outdated) 1.1 spec that PyYAML implements, within
single quoted scalars:
In a multi-line single-quoted scalar, line breaks are subject to (flow) line folding, and any trailing white space is excluded from the content.
And line-folding:
Line folding allows long lines to be broken for readability, while retaining the original semantics of a single long line. When folding is done, any line break ending an empty line is preserved. In addition, any specific line breaks are also preserved, even when ending a non-empty line.
This means that your first two examples are the same, as the
line-break is read as if there is a space.
The third example is different, because it actually contains a newline after loading, because "any line break ending an empty line is preserved".
In order to understand why that dumps back as it was loaded, you have to know that PyYAML doesn't
maintain any information about the quoting (nor about the single newline in the first example), it
just loads that scalar into a Python string. During dumping PyYAML evaluates how that string
can best be written and the options it considers (unless you try to force things using the default_style argument to dump()): plain style, single quoted style, double quoted style.
PyYAML will use plain style (without quotes) when possible, but since
the string starts with {, this leads to confusion (collision) with
that character's use as the start of a flow style mapping. So quoting
is necessary. Since there are also double quotes in the string, and
there are no characters that need backslash escaping the "cleanest"
representation that PyYAML can choose is single quoted style, and in
that style it needs to represent a line-break by including an emtpy
line withing the single quoted scalar.
I would personally prefer using a block style literal scalar to represent your last example:
FOO:
- Bar: |
{"HELLO":
"WORLD"}
but if you load, then dump that using PyYAML its readability would be lost.
Although worded differently in the YAML 1.2 specification (released almost 10 years ago) the line-folding works the same, so this would "work" in a similar way with a more up-to-date YAML loader/dumper. My package ruamel.yaml, for loading/dumping YAML 1.2 will properly maintain the block style if you set the attribute preserve_quotes = True on the YAML() instance, but it will still get rid of the newline in your first example. This could be implemented (as is shown by ruamel.yaml preserving appropriate newline positions in folded style block scalars), but nobody ever asked for that, probably because if people want that kind of control over wrapping they use a block style to start with.

Related

Sphinx issues mysterious error in literal blocks

In Sphinx (the ReStructuredText publishing system), are there any obscure rules that limit what a literal block can contain?
Background: My document contains many literal blocks that follow a double-colon paragraph, like this:
Background:... follow a double-colon paragraph, like this::
$ sudo su
# echo ttyS0,115200 > /sys/module/kgdboc/parameters/kgdboc
This block (with a different preceding paragraph) is one of the ones that issues an error: "WARNING: Inconsistent literal block quoting." The message indicates that the error is in the "echo" line. In the HTML output the literal block contains only the "sudo" line; the "echo" line is treated as ordinary text.
I haven't been able to identify any common property in the lines that report errors, or anything that distinguishes them, as a class, from lines in other literal blocks that don't get errors.
I stripped down the project to isolate the problem, and I identified it that way.
I had a numbered list item that contained a double-colon literal block that was indented only as far as the list item's text, like this:
2. Set up the... directory::
$ A Linux command
$ Another Linux command
$ And ANOTHER Linux command
$ etc.
When I indented the literal block further, the problem went away.
I was misled by two things:
The message does not point to the first line in the literal block, but to some apparently random line within it. In the case above, it pointed to the fifth line (out of eight) in the block!
In most cases this form of indention, although incorrect, works just fine.
Isolating the problem is a brute-force method of solving it, but is often effective when deduction fails. I'll keep that in mind in the future.

How can I refactor an existing source code file to normalize all use of tab?

Sometimes I find myself editing a C source file which sees both use of tab as four spaces, and regular tab.
Is there any tool that attempts to parse the file and "normalize" this, i.e. convert all occurrences of four spaces to regular tab, or all occurrences of tab to four spaces, to keep it consistent?
I assume something like this can be done even with just a simple vim one-liner?
There's :retab and :retab! which can help, but there are caveats.
It's easier if you're using spaces for indentation, then just set 'expandtab' and execute :retab, then all your tabs will be converted to spaces at the appropriate tab stops (which default to 8.) That's easy and there are no traps in this method!
If you want to use 4 space indentation, then keep 'expandtab' enabled and set 'softtabstop' to 4. (Avoid modifying the 'tabstop' option, it should always stay at 8.)
If you want to do the inverse and convert to tabs instead, you could set 'noexpandtab' and then use :retab! (which will also look at sequences of spaces and try to convert them back to tabs.) The main problem with this approach is that it won't just consider indentation for conversion, but also sequences of spaces in the middle of lines, which can cause the operation to affect strings inside your code, which would be highly undesirable.
Perhaps a better approach for replacing spaces with tabs for indentation is to use the following substitute command:
:%s#^\s\+#\=repeat("\t", indent('.') / &tabstop).repeat(" ", indent('.') % &tabstop)#
Yeah it's a mouthful... It's matching whitespace at the beginning of the lines, then using the indent() function to find the total indentation (that function calculates indentation taking tab stops in consideration), then dividing that by the 'tabstop' to decide how many tabs and how many spaces a specific line needs.
If this command works for you, you might want to consider adding a mapping or :command for it, to keep it handy. For example:
command! -range=% Retab <line1>,<line2>s#^\s\+#\=repeat("\t", indent('.') / &tabstop).repeat(" ", indent('.') % &tabstop)
This also allows you to "Retab" a range of the file, including one you select with a visual selection.
Finally, one last alternative to :retab is that to ask Vim to "reformat" your code completely, using the = command, which will use the current 'indentexpr' or other indentation configurations such as 'cindent' to completely reindent the block. That typically respects your 'noexpandtab' and 'smarttabstop' options, so it use tabs and spaces for indentation consistently. The downside of this approach is that it will completely reformat your code, including changing indentation in places. The upside is that it typically has a semantic understanding of the language and will be able to take that in consideration when reindenting the code block.

ruamel condensing comments and injecting 0x07

Given the following code:
from ruamel.yaml import YAML
yaml = YAML()
with open(filename) as f:
z = yaml.load(f)
yaml.dump(z, sys.stdout)
And the following file:
a: >
Hello.<b>
World.
When <b> is a space character (0x20), produces the following YAML:
a: >
Hello. <0x07> World.
When <0x07> is the byte 0x07.
Trying to re-load this YAML using PyYAML results in an error as 0x07 is an invalid character.
This does not happen when I remove the trailing blank after the Hello. in the input YAML.
Any idea what can cause this?
The BEL character (0x07, \a) is inserted during parsing in block style folded strings, so that the representation for that scalar in Python (ruamel.yaml.scalarstring.FoldedScalarString) can register the positions where the original folds did occur. At dump time, the reverse is done: the positions are translated to BEL characters (if they correspond to spaces) and so transmit these folding positions from the representer to the emitter, which then outputs the scalar with the "folds" at the original points the occurred. This of course can/should only happen if the positions still represent "foldable" positions.
The problem here is that the parser should, during loading, complain that your YAML is incorrect. It fails to do so, loads faulty data and then fails to properly dump the mess it allowed to be loaded in the first place, resulting in the BEL character ending up in the output.
The YAML specification states:
Folding allows long lines to be broken anywhere a single space character separates two non-space characters.
And as your line has not been folded between two non-space characters, this should result in a warning, if not in an immediate parser error.¹
Additionally the representer should of course be smart enough not to replace a space by a BEL character if the space it is replacing is adjacent to white-space. That situation can also occur after changing a string that was loaded from correct YAML with a folded string. I essentially consider that a bug.
The ruamel.yaml>0.15.80 has a fix for the incorrect representation. An implementation on the error/warning on loading is likely to follow soon.
¹ When only issuing a warning, my initial reaction is that I should strip the faulty trailing space, or spaces in case there are more, because it is invisible, and keeping the fold.

Terminal overwriting same line when too long

In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]☔︎ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
☂☃☽☀︎☔︎
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.
The symbol being printed ☔︎ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: ☔︉ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[…\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.
This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '

Freemarker Interpolation stripping whitespace?

I seem to be having issues with leading/trailing spaces in textareas!
If the last user has typed values into a textarea with leading/trailing spaces across multiple lines, they all disappear with exception to one space in the beginning & end.
Example:
If the textbox had the following lines: (quotes present only to help illustrate spaces)
" 3.0"
" 2.2 "
"0.3 "
it would be saved in the backend as
"<textarea id=... > 3.0/n 2.2 /n0.3 </textarea>"
My template (for this part) is fairly straightforward (entire template, not as easy...): ${label} ${textField}
When I load up the values again, I notice getTextField() is properly getting the desired string, quoted earlier... But when I look at the html page it's showing
" 3.0"
"2.2"
"0.3 "
And of course when "View Sourcing" it doesn't have the string seen in getTextField()
What I've tried:
Ensure the backend has setWhitespaceStripping(false); set
Adding the <#ftl strip_whitespace=false>
Adding the <#nl> on the same line as ${textField}
No matter what I've tried, I'm not having luck keeping the spaces after the interpolation.
Any help would be very appreciated!
Maybe you are inside a <#compress>...</#compress> (or <#compress>...</#compress>) block. Those filter the whole output on runtime and reduce whitespace regardless where it comes from. I recommend not using this directive. It makes the output somewhat smaller, but it has runtime overhead, and can corrupt output in cases like this.
FreeMarker interpolations don't remove whitespace from the inserted value, or change the value in any way. Except, if you are lexically inside an <#escape ...>....</#escape>, block, that will be automatically applied. But it's unlikely that you have an escaping expression that corrupts whitespace. But to be sure., you can check if there's any <#escape ...> in the same template file (no need to check elsewhere, as it's not a runtime directive).
strip_whitespace and #nt are only removing white-space during parsing (that's before execution), so they are unrelated.
You can also check if the whitespace is still there in the inserted value before inserting like this:
${textField?replace(" ", "[S]")?replace("\n", "[N]")?replace("\t", "[T]")}
If you find that they were already removed that probably means that they were already removed before the value was put into the data-model. So then if wasn't FreeMarker.

Resources