Markdown within yaml / yaml multi-line escape sequence? - yaml

Is it possible to store unescaped markdown documents in yaml? I've tested
key:|+
markdown text block that could have any combination of line breaks, >, -, :, ', " etc etc.
This does not work. I need something like CDATA or python style triple-quotes for yamal. Any ideas?

In literal style of scalar type (what you used in example) line brakes needs still to be "escaped" (in this case intended correctly).
And you can only have printable characters.
I am not fammiliar with markdown, but in case you would need to save unprintable characters, you would definitelly to escape them.
From Yaml specification:
To ensure readability, YAML streams
use only the printable subset of the
Unicode character set. The allowed
character range explicitly excludes
the C0 control block #x0-#x1F (except
for TAB #x9, LF #xA, and CR #xD which
are allowed), DEL #x7F, the C1 control
block #x80-#x9F (except for NEL #x85
which is allowed), the surrogate
block #xD800-#xDFFF, #xFFFE, and #xFFFF.
On input, a YAML processor must accept
all Unicode characters except those
explicitly excluded above.
On output, a YAML processor must only
produce acceptable characters. Any
excluded characters must be presented
using escape sequences.

Related

Ruby on rails, how to remove white-space from japanese word?

I am trying to remove white-space from japanese word.
input "かいしゃ(会社)"
output "かいしゃ(会社)"
The space here is consumed by the parentheses. They are not your regular ASCII parentheses, they are of the "full width" flavor.
If you want to replace them with ASCII parentheses, you can do it like this:
compact_input = input.gsub("\uFF08", '(') # and a similar step for the closing parenthesis
Although this might make your string look weird in japanese (I don't know the language well enough, so can't say)

Creation of file-format in Snow-flake

I am new to SF. I have a typical problem that I have faced while loading some data. The delimiter is part of extended ascii. It does not come in 0-127. We use thorn (ascii - 254) as delimiter. My Qn is while specifying the delimiter can I give the ascii code of that delimiter instead of actual character (44 instead of comma, 9 instead of tab etc)
Thanks in advance
You can specify the hex/octal code of any valid Unicode delimiter in the FIELD_DELIMITER option of the File Format. From the documentation:
The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.
For example, for fields delimited by the thorn (Þ) character, specify the octal (\336) or hex (0xDE) value. Also accepts a value of NONE.

Is using ASCII 10 inside a HL7 segment a valid way to represent a new line?

Placing an ASCII 10 (0A) character somewhere inside of a segment of an HL7 message to represent a new line character. Is this valid?
From what I can see it is recommend to use \X0D\ or \X0D0A\ to represent a new line character for plain text format HL7. Is using just the 0A ASCII character explicitly invalid HL7?
To respond to the question "Is using just the 0A ASCII character explicitly invalid HL7?":
The character 0A is not mentioned anywhere in the HL7 specs as being special.
Extract from the HL7 2.5 US specs:
2.5.4 Message delimiters
In constructing a message, certain special characters are used. They are the segment terminator, the field
separator, the component separator, subcomponent separator, repetition separator, and escape character. The
segment terminator is always a carriage return (in ASCII, a hex 0D). The other delimiters are defined in the
MSH segment, with the field delimiter in the 4th character position, and the other delimiters occurring as in
the field called Encoding Characters, which is the first field after the segment ID. The delimiter values used
in the MSH segment are the delimiter values used throughout the entire message. In the absence of other
considerations, HL7 recommends the suggested values found in Figure 2-1 delimiter values.
Strictly speaking this would mean that you could use the character 0A just as any of the characters other than the 6 previously mentioned.
<end of "formal" reply>
That being said, I concur with Dale H. that you should better stay away from using this character in the content of an HL7 message. Since most editors (except old-fashioned Notepad on Windows) will display this character as a new line, you might unwillingly think that a segment was truncated or malformed. And I've had at least one instance where the interface engine indeed handled that character as a segment termination (which in itself is invalid, and the interface engine build was modified to not do this anymore).
So better avoid this. But in situations where you don't control the output, it doesn't seem to be a formally disallowed character...
Linefeeds (0x0A) are not allowed in HL7 messages. If you edit messages with notepad, wordpad and many other text editors, they will convert carriage returns (0x0D) to CR/LF (0x0D 0x0A) and if you save, you now have a corrupt HL7 message. Avoid LFs (0x0A).
If you only send 0A then there is no way to determine that you wanted ASCII 10/line feed and it would be assumed you wanted a zero and an A.
Standard HL7 with the escape character being a \, then yes the recommended way would be \X0A\. The \X representing the start of hexadecimal data, followed by two-character hexadecimal values, ending with a \.
That being said, if you are sending this data to a system then they should be able to tell you what they accept for lines feeds. I've seen systems that use \.br\ or the repetition character ~ to determine a new line. And sometimes they want repeating segments. For example below, each OBX segment is a new line of a report in the system.
OBX|1|TX|||This is line one
OBX|2|TX|||This is line two

Allowed characters in map key identifier in YAML?

Which characters are and are not allowed in a key (i.e. example in example: "Value") in YAML?
According to the YAML 1.2 specification simply advises using printable characters with explicit control characters being excluded (see here):
In constructing key names, characters the YAML spec. uses to denote syntax or special meaning need to be avoided (e.g. # denotes comment, > denotes folding, - denotes list, etc.).
Essentially, you are left to the relative coding conventions (restrictions) by whatever code (parser/tool implementation) that needs to consume your YAML document. The more you stick with alphanumerics the better; it has simply been our experience that the underscore has worked with most tooling we have encountered.
It has been a shared practice with others we work with to convert the period character . to an underscore character _ when mapping namespace syntax that uses periods to YAML. Some people have similarly used hyphens successfully, but we have seen it misconstrued in some implementations.
Any character (if properly quoted by either single quotes 'example' or double quotes "example"). Please be aware that the key does not have to be a scalar ('example'). It can be a list or a map.

Getting a meaning of the string

I have the following string "\u3048\u3075\u3057\u3093". I got the string
from a web page as part of returned data in JSONP.
What is that? It looks like UTF8, but then should it look like "U+3048U+3075U+3057U+3093"?
What's the meaning of the backslashes (\)?
How can I convert it to a human-readable form?
I'm looking to a solution with Ruby, but any explanation of what's going on here is appreciated.
The U+3048 syntax is normally used to represent the Unicode code point of a character. Such code point is fixed and does not depend on the encoding (UTF-8, UTF-32...).
A JSON string is composed of Unicode characters except double quote, backslash and those in the U+0000 to U+001F range (control characters). Characters can be represented with a escape sequence starting with \u and followed by 4 hexadecimal digits that represent the Unicode code point of the character. This is the JavaScript syntax (JSON is a subset of it). In JavaScript, the backslash is used as escape char.
It is Unicode, but not in UTF-8, it is in UTF-16. You might ignore surrogate pairs and deem it as 4-digit hexadecimal code points of a Unicode code character.
Using Ruby 1.9:
require 'json'
puts JSON.parse("[\"\\u4e00\",\"\\u4e8c\"]")
Prints:
一
二
Unicode characters in JSON are escaped as backslash u followed by four hex digits. See the string production on json.org.
Any JSON parser will convert it to the correct representation for your platform (if it doesn't, then by definition it is not a JSON parser)

Resources