Creation of file-format in Snow-flake - ascii

I am new to SF. I have a typical problem that I have faced while loading some data. The delimiter is part of extended ascii. It does not come in 0-127. We use thorn (ascii - 254) as delimiter. My Qn is while specifying the delimiter can I give the ascii code of that delimiter instead of actual character (44 instead of comma, 9 instead of tab etc)
Thanks in advance

You can specify the hex/octal code of any valid Unicode delimiter in the FIELD_DELIMITER option of the File Format. From the documentation:
The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.
For example, for fields delimited by the thorn (Þ) character, specify the octal (\336) or hex (0xDE) value. Also accepts a value of NONE.

Related

Is using ASCII 10 inside a HL7 segment a valid way to represent a new line?

Placing an ASCII 10 (0A) character somewhere inside of a segment of an HL7 message to represent a new line character. Is this valid?
From what I can see it is recommend to use \X0D\ or \X0D0A\ to represent a new line character for plain text format HL7. Is using just the 0A ASCII character explicitly invalid HL7?
To respond to the question "Is using just the 0A ASCII character explicitly invalid HL7?":
The character 0A is not mentioned anywhere in the HL7 specs as being special.
Extract from the HL7 2.5 US specs:
2.5.4 Message delimiters
In constructing a message, certain special characters are used. They are the segment terminator, the field
separator, the component separator, subcomponent separator, repetition separator, and escape character. The
segment terminator is always a carriage return (in ASCII, a hex 0D). The other delimiters are defined in the
MSH segment, with the field delimiter in the 4th character position, and the other delimiters occurring as in
the field called Encoding Characters, which is the first field after the segment ID. The delimiter values used
in the MSH segment are the delimiter values used throughout the entire message. In the absence of other
considerations, HL7 recommends the suggested values found in Figure 2-1 delimiter values.
Strictly speaking this would mean that you could use the character 0A just as any of the characters other than the 6 previously mentioned.
<end of "formal" reply>
That being said, I concur with Dale H. that you should better stay away from using this character in the content of an HL7 message. Since most editors (except old-fashioned Notepad on Windows) will display this character as a new line, you might unwillingly think that a segment was truncated or malformed. And I've had at least one instance where the interface engine indeed handled that character as a segment termination (which in itself is invalid, and the interface engine build was modified to not do this anymore).
So better avoid this. But in situations where you don't control the output, it doesn't seem to be a formally disallowed character...
Linefeeds (0x0A) are not allowed in HL7 messages. If you edit messages with notepad, wordpad and many other text editors, they will convert carriage returns (0x0D) to CR/LF (0x0D 0x0A) and if you save, you now have a corrupt HL7 message. Avoid LFs (0x0A).
If you only send 0A then there is no way to determine that you wanted ASCII 10/line feed and it would be assumed you wanted a zero and an A.
Standard HL7 with the escape character being a \, then yes the recommended way would be \X0A\. The \X representing the start of hexadecimal data, followed by two-character hexadecimal values, ending with a \.
That being said, if you are sending this data to a system then they should be able to tell you what they accept for lines feeds. I've seen systems that use \.br\ or the repetition character ~ to determine a new line. And sometimes they want repeating segments. For example below, each OBX segment is a new line of a report in the system.
OBX|1|TX|||This is line one
OBX|2|TX|||This is line two

How can I add hair space ( ) to gsub string?

I have the following line in a plugin to display page views on my Jekyll site:
html = pv.to_s.reverse.gsub(/...(?=.)/,'\& ').reverse
It adds space between thousands, for example 23 678.
How can I add hair space   instead of regular space in this string?
In HTML   is a so-called decimal numeric character reference:
The ampersand must be followed by a "#" (U+0023) character, followed by one or more ASCII digits, representing a base-ten integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a ";" (U+003B) character.
Ruby has the \u escape sequence. However it expects the following characters to represent a hexadecimal (base-sixteen) integer. That's 200A. You also have to use a double-quoted string literal which means now the \ character needs to be escaped with another one:
"\\&\u200A"
Alternatively just use it directly:
'\& '

MacVim Replace All Issue

I have an html file that I need to replace some characters with html entities. Right now I'm trying to replace — with — but when I use the Replace All button, the result is that all of those instances of — are replaced with —mdash;
I thought maybe escaping the "&" will work, so I changed the Replace with value to \— but that just results in \—mdash;
The strange thing is that if I go to each, one by one, i.e., click Next, then click Replace, and so on, then it replaces it correctly.
Is this a bug in MacVim? Or am I missing something?
Enter into command line:
:%s/—/\—/g
Also it's possible to get character code. Place your cursor on the character and press ga. Use decimal, hex or octal code into replacement string:
\%d match specified decimal character
\%x match specified hex character
\%o match specified octal character
\%u match specified multibyte character
\%U match specified large multibyte character
:%s/\%d8212/\$mdash;/g

Getting a meaning of the string

I have the following string "\u3048\u3075\u3057\u3093". I got the string
from a web page as part of returned data in JSONP.
What is that? It looks like UTF8, but then should it look like "U+3048U+3075U+3057U+3093"?
What's the meaning of the backslashes (\)?
How can I convert it to a human-readable form?
I'm looking to a solution with Ruby, but any explanation of what's going on here is appreciated.
The U+3048 syntax is normally used to represent the Unicode code point of a character. Such code point is fixed and does not depend on the encoding (UTF-8, UTF-32...).
A JSON string is composed of Unicode characters except double quote, backslash and those in the U+0000 to U+001F range (control characters). Characters can be represented with a escape sequence starting with \u and followed by 4 hexadecimal digits that represent the Unicode code point of the character. This is the JavaScript syntax (JSON is a subset of it). In JavaScript, the backslash is used as escape char.
It is Unicode, but not in UTF-8, it is in UTF-16. You might ignore surrogate pairs and deem it as 4-digit hexadecimal code points of a Unicode code character.
Using Ruby 1.9:
require 'json'
puts JSON.parse("[\"\\u4e00\",\"\\u4e8c\"]")
Prints:
一
二
Unicode characters in JSON are escaped as backslash u followed by four hex digits. See the string production on json.org.
Any JSON parser will convert it to the correct representation for your platform (if it doesn't, then by definition it is not a JSON parser)

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Resources