ruamel condensing comments and injecting 0x07 - yaml

Given the following code:
from ruamel.yaml import YAML
yaml = YAML()
with open(filename) as f:
z = yaml.load(f)
yaml.dump(z, sys.stdout)
And the following file:
a: >
Hello.<b>
World.
When <b> is a space character (0x20), produces the following YAML:
a: >
Hello. <0x07> World.
When <0x07> is the byte 0x07.
Trying to re-load this YAML using PyYAML results in an error as 0x07 is an invalid character.
This does not happen when I remove the trailing blank after the Hello. in the input YAML.
Any idea what can cause this?

The BEL character (0x07, \a) is inserted during parsing in block style folded strings, so that the representation for that scalar in Python (ruamel.yaml.scalarstring.FoldedScalarString) can register the positions where the original folds did occur. At dump time, the reverse is done: the positions are translated to BEL characters (if they correspond to spaces) and so transmit these folding positions from the representer to the emitter, which then outputs the scalar with the "folds" at the original points the occurred. This of course can/should only happen if the positions still represent "foldable" positions.
The problem here is that the parser should, during loading, complain that your YAML is incorrect. It fails to do so, loads faulty data and then fails to properly dump the mess it allowed to be loaded in the first place, resulting in the BEL character ending up in the output.
The YAML specification states:
Folding allows long lines to be broken anywhere a single space character separates two non-space characters.
And as your line has not been folded between two non-space characters, this should result in a warning, if not in an immediate parser error.¹
Additionally the representer should of course be smart enough not to replace a space by a BEL character if the space it is replacing is adjacent to white-space. That situation can also occur after changing a string that was loaded from correct YAML with a folded string. I essentially consider that a bug.
The ruamel.yaml>0.15.80 has a fix for the incorrect representation. An implementation on the error/warning on loading is likely to follow soon.
¹ When only issuing a warning, my initial reaction is that I should strip the faulty trailing space, or spaces in case there are more, because it is invisible, and keeping the fold.

Related

Find and replace non utf8 character

I have a process that inserts data into PDFs that eventually loads into a system that gets searched based on that inserted data. The inserted data looks something like:
<<
/IBM-ODIndexes
<< /Private
<<
/DOB (05031983)
/FULL_NAME (TEST USER)
/YEAR (2020)
>>
/LastModified(D:20210112201530)
>>
However, there are instances where the data in the FULL_NAME field contains non UTF8 characters and then users are unable to search the data. Specifically apostrophes come over from Microsoft Word and then gets interpreted like this:
/FULL_NAME (JERRY OÃ<83>¢ââ<80><9a>‰â<80><9e>¢CONNELL)
In this case I am looking to strip out the apostrophe that is represented as Ã<83>¢ââ<80><9a>‰â<80><9e>¢ and replace it with a white space.
There are several complexities here, but in general I would say that the only reliable way to deal with it is to figure out the text encoding of the incoming document and converting it to the target encoding.
Ã<83>¢ââ<80><9a>‰â<80><9e>¢ is 34 characters (that is, at least 34 bytes), and no single encoding ever used that much space for a single character. What’s probably happening is multiple levels of encoding, such as HTML entities, base64, UTF-8/16/32 or escape characters like %% to represent % in SQL or \\ to represent \ in Bash. Reversing all these levels of encoding manually is going to involve quite a lot of reading the huge docx standard. The simpler alternative is to use a library which can just convert the entire text into a known character encoding for you, at which point you have to do at most a single conversion into UTF-8.
Another argument for this is that the “apostrophe string” does contain otherwise harmless characters like “a” and “e”. Without at least some understanding of the encodings you’re unlikely to be able to separate encoded characters from non-encoded ones, which would make the resulting text full of invalid text.

Reading and writing back yaml files with multi-line strings

I have to read a yaml file, modify it and write back using pyYAML. Every thing works fine except when there is multi-line string values in single quotes e.g. if input yaml file looks like
FOO:
- Bar: '{"HELLO":
"WORLD"}'
then reading it as data=yaml.load(open("foo.yaml")) and writing it yaml.dump(data, fref, default_flow_style=False) generates something like
FOO:
- Bar: '{"HELLO": "WORLD"}'
i.e. without the extra line for Bar value. Strange thing is that if input file has something like
FOO:
- Bar: '{"HELLO":
"WORLD"}'
i.e. one extra new line for Bar value then writing it back generates the correct number of new lines. Any idea what I am doing wrong?
You are not doing anything wrong, but you probably should have read more of the YAML specification.
According to the (outdated) 1.1 spec that PyYAML implements, within
single quoted scalars:
In a multi-line single-quoted scalar, line breaks are subject to (flow) line folding, and any trailing white space is excluded from the content.
And line-folding:
Line folding allows long lines to be broken for readability, while retaining the original semantics of a single long line. When folding is done, any line break ending an empty line is preserved. In addition, any specific line breaks are also preserved, even when ending a non-empty line.
This means that your first two examples are the same, as the
line-break is read as if there is a space.
The third example is different, because it actually contains a newline after loading, because "any line break ending an empty line is preserved".
In order to understand why that dumps back as it was loaded, you have to know that PyYAML doesn't
maintain any information about the quoting (nor about the single newline in the first example), it
just loads that scalar into a Python string. During dumping PyYAML evaluates how that string
can best be written and the options it considers (unless you try to force things using the default_style argument to dump()): plain style, single quoted style, double quoted style.
PyYAML will use plain style (without quotes) when possible, but since
the string starts with {, this leads to confusion (collision) with
that character's use as the start of a flow style mapping. So quoting
is necessary. Since there are also double quotes in the string, and
there are no characters that need backslash escaping the "cleanest"
representation that PyYAML can choose is single quoted style, and in
that style it needs to represent a line-break by including an emtpy
line withing the single quoted scalar.
I would personally prefer using a block style literal scalar to represent your last example:
FOO:
- Bar: |
{"HELLO":
"WORLD"}
but if you load, then dump that using PyYAML its readability would be lost.
Although worded differently in the YAML 1.2 specification (released almost 10 years ago) the line-folding works the same, so this would "work" in a similar way with a more up-to-date YAML loader/dumper. My package ruamel.yaml, for loading/dumping YAML 1.2 will properly maintain the block style if you set the attribute preserve_quotes = True on the YAML() instance, but it will still get rid of the newline in your first example. This could be implemented (as is shown by ruamel.yaml preserving appropriate newline positions in folded style block scalars), but nobody ever asked for that, probably because if people want that kind of control over wrapping they use a block style to start with.

Is using ASCII 10 inside a HL7 segment a valid way to represent a new line?

Placing an ASCII 10 (0A) character somewhere inside of a segment of an HL7 message to represent a new line character. Is this valid?
From what I can see it is recommend to use \X0D\ or \X0D0A\ to represent a new line character for plain text format HL7. Is using just the 0A ASCII character explicitly invalid HL7?
To respond to the question "Is using just the 0A ASCII character explicitly invalid HL7?":
The character 0A is not mentioned anywhere in the HL7 specs as being special.
Extract from the HL7 2.5 US specs:
2.5.4 Message delimiters
In constructing a message, certain special characters are used. They are the segment terminator, the field
separator, the component separator, subcomponent separator, repetition separator, and escape character. The
segment terminator is always a carriage return (in ASCII, a hex 0D). The other delimiters are defined in the
MSH segment, with the field delimiter in the 4th character position, and the other delimiters occurring as in
the field called Encoding Characters, which is the first field after the segment ID. The delimiter values used
in the MSH segment are the delimiter values used throughout the entire message. In the absence of other
considerations, HL7 recommends the suggested values found in Figure 2-1 delimiter values.
Strictly speaking this would mean that you could use the character 0A just as any of the characters other than the 6 previously mentioned.
<end of "formal" reply>
That being said, I concur with Dale H. that you should better stay away from using this character in the content of an HL7 message. Since most editors (except old-fashioned Notepad on Windows) will display this character as a new line, you might unwillingly think that a segment was truncated or malformed. And I've had at least one instance where the interface engine indeed handled that character as a segment termination (which in itself is invalid, and the interface engine build was modified to not do this anymore).
So better avoid this. But in situations where you don't control the output, it doesn't seem to be a formally disallowed character...
Linefeeds (0x0A) are not allowed in HL7 messages. If you edit messages with notepad, wordpad and many other text editors, they will convert carriage returns (0x0D) to CR/LF (0x0D 0x0A) and if you save, you now have a corrupt HL7 message. Avoid LFs (0x0A).
If you only send 0A then there is no way to determine that you wanted ASCII 10/line feed and it would be assumed you wanted a zero and an A.
Standard HL7 with the escape character being a \, then yes the recommended way would be \X0A\. The \X representing the start of hexadecimal data, followed by two-character hexadecimal values, ending with a \.
That being said, if you are sending this data to a system then they should be able to tell you what they accept for lines feeds. I've seen systems that use \.br\ or the repetition character ~ to determine a new line. And sometimes they want repeating segments. For example below, each OBX segment is a new line of a report in the system.
OBX|1|TX|||This is line one
OBX|2|TX|||This is line two

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

Ruby CGI.unescapeHTML generating strange charactors

I have backed up a bunch of markdown formatted comments into an XML document. This of course meant I needed to HTMLescape them. When I try to use CGI.unescapeHTML it adds a bunch of strange characters into the markup that do not render well in all browsers.
Specifically, it replaces two spaces with "\302\240 ", but not consistently. How do I get it to stop this behavior?
eg:
s = "I am seeing more and more <a href="http://github.com/aslakhellesoy/cucumber /tree/master">Cucumber</a> usage.  This is a good thing!  But I'm also seeing people who are not using regular expressions to their fullest.  Here are some quick regex tips to keep you features readable:
* `(?:a|an)` -- using a this construct you can group things wihout actually matching them.  I'm seeing a lot of steps that have unused params because someone needed a group but didn't know how to avoid capturing it&#x000A"
CGI.unescapeHTML s
# => "I am seeing more and more Cucumber usage.\302\240 This is a good thing!\302\240 But I'm..."
Those are non-breaking spaces. Read up on wikipedia.
In computer-based text processing and digital typesetting, a
non-breaking space, also known as a no-break space or
non-breakable space (NBSP), is a variant of the space character
that prevents an automatic line break (line wrap) at its position.
In certain formats (such as HTML), it also prevents the
“collapsing” of multiple consecutive whitespace characters into a
single space. The non-breaking space is also known as a hard space
or fixed space. In Unicode, it is encoded as U+00A0 no-break space
(HTML:   ).

Resources