Find and replace (increment) ASN.1 BER hex value - bash

I have a long string of hex (converted from BER ASN.1) where I need to find and increment a particular value which is incorrect.
<TAG> <LENGTH> <VALUE to INCREMENT>
the ASN.1 tag is 84 and the length byte will change from 01 to 02 when the value > 127dec. And the value to increment will therefore become 2 bytes.
The value should start at 00.
e.g.
- Original file: ...840101...840107...84020085...84020097
- New file: ...840100...840101...84020080...84020081
Any ideas how best to do this, preferably using standard bash commands?

Ilya Etingof hinted at this already, but to be explicit about it, BER uses TAG, LENGTH, VALUE (TLV) encoding, where the VALUE can itself be a TLV. If you change the length in a TLV that is nested inside a TLV, you will need to update all of the lengths of the enclosing TLVs as well. It is not a simple search/replace operation.

Assuming you have the octet-stream in text work already, you may consider searching/replacing pieces of text with awk or sed. If you can only use bash, may be variable substitution (${parameter/pattern/string} or ${parameter:offset:length}) would work?
Keep in mind however, that BER is quite flexible in the sense that (sometimes) the same data structure may be encoded differently and that would still constitute a valid encoding. The rationale behind that is to allow the encoder to optimize for its very own situation (e.g. save on memory or CPU cycles or on copying etc).
What I am trying to say that depending on your situation there may be a chance that your search/replace logic may fail. The bullet-proof solution would be to fully decode your BER octet stream, change the data structure you need and re-encode it back into BER.

Related

JavaScript alternative to Ruby's force_encoding(Encoding::ASCII_8BIT)

I'm making my way through the Building Git book, but attempting to build my implementation in JavaScript. I'm stumped at the part of reading data in this file format that apparently only Ruby uses. Here is the excerpt from the book about this:
Note that we set the string’s encoding13 to ASCII_8BIT, which is Ruby’s way of saying that the string represents arbitrary binary data rather than text per se. Although the blobs we’ll be storing will all be ASCII-compatible source code, Git does allow blobs to be any kind of file, and certainly other kinds of objects — especially trees — will contain non-textual data. Setting the encoding this way means we don’t get surprising errors when the string is concatenated with others; Ruby sees that it’s binary data and just concatenates the bytes and won’t try to perform any character conversions.
Is there a way to emulate this encoding in JS?
or
Is there an alternative encoding I can use that JS and Ruby share that won't break anything?
Additionally, I've tried using Buffer.from(< text input >, 'binary') but it doesn't result in the same amount of bytes that the ruby ASCII-8BIT returns because in Node.js binary maps to ISO-8859-1.
Node certainly supports binary data, that's kind of what Buffer is for. However, it is crucial to know what you are converting into what. For example, the emoji "☺️" is encoded as six bytes in UTF-8:
// UTF-16 (JS) string to UTF-8 representation
Buffer.from('☺️', 'utf-8')
// => <Buffer e2 98 ba ef b8 8f>
If you happen to have a string that is not a native JS string (i.e. its encoding is different), you can use the encoding parameter to make Buffer interpret each character in a different manner (though only several different conversions are supported). For example, if we have a string of six characters that correspond to the six numbers above, it is not a smiley face for JavaScript, but Buffer.from can help us repackage it:
Buffer.from('\u00e2\u0098\u00ba\u00ef\u00b8\u008f', 'binary')
// => <Buffer e2 98 ba ef b8 8f>
JavaScript itself has only one encoding for its strings; thus, the parameter 'binary' is not really the binary encoding, but a mode of operation for Buffer.from, telling it that the string would have been a binary string if each character were one byte (however, since JavaScript internally uses UCS-2, each character is always represented by two bytes). Thus, if you use it on something that is not a string of characters in range from U+0000 to U+00FF, it will not do the correct thing, because there no such thing (GIGO principle). What it will actually do is get the lower byte of each character, which is probably not what you want:
Buffer.from('STUFF', 'binary') // 8BIT range: U+0000 to U+00FF
// => <Buffer 42 59 54 45 53> ("STUFF")
Buffer.from('STUFF', 'binary') // U+FF33 U+FF34 U+FF35 U+FF26 U+FF26
// => <Buffer 33 34 35 26 26> (garbage)
So, Node's Buffer structure exactly corresponds to Ruby's ASCII-8BIT "encoding" (binary is an encoding like "bald" is a hair style — it simply means no interpretation is attached to bytes; e.g. in ASCII, 65 means "A"; but in binary "encoding", 65 is just 65). Buffer.from with 'binary' lets you convert weird strings where one character corresponds to one byte into a Buffer. It is not the normal way of handling binary data; its function is to un-mess-up binary data when it has been read incorrectly into a string.
I assume you are reading a file as string, then trying to convert it to a Buffer — but your string is not actually in what Node considers to be the "binary" form (a sequence of characters in range from U+0000 to U+00FF; thus "in Node.js binary maps to ISO-8859-1" is not really true, because ISO-8859-1 is a sequence of characters in range from 0x00 to 0xFF — a single-byte encoding!).
Ideally, to have a binary representation of file contents, you would want to read the file as a Buffer in the first place (by using fs.readFile without an encoding), without ever touching a string.
(If my guess here is incorrect, please specify what the contents of your < text input > is, and how you obtain it, and in which case "it doesn't result in the same amount of bytes".)
EDIT: I seem to like typing Array.from too much. It's Buffer.from, of course.

Can I alpha sort base32/64 encoded MD5 hashes?

I've got a massive file of hex encoded MD5 values that I'm using linux 'sort' utility to sort. The result is that the hashes come out in sequential order (which is what I need for the next stage of processing). E.g:
000001C35AE83CEFE245D255FFC4CE11
000003E4B110FE637E0B4172B386ACAC
000004AAD0EB3D896B654A960B0111FA
In the interest of speeding up the sort operation (and making the files smaller), I was considering encoding the data as base32 or base64.
The question is, would an alpha-sort of the base32/64 data get me the same result? My quick tests seem to indicate that it would work. For example, the above three hex strings correspond 1:1 to these base64 strings:
AAABw1roPO/iRdJV/8TOEQ==
AAAD5LEQ/mN+C0Fys4asrA==
AAAEqtDrPYlrZUqWCwER+g==
But I'm unsure as to the sort order when it comes to special characters used in Base64 like "/" and "+" and how those would be treated in the context of an alpha sort.
Note: I happen to be using the linux sort utility but the question still applies to other alpha-sorting tools. The tool used is not really part of the question.
I've since discovered that this isn't possible with the standard base32/64 implementations. There exists however a base32 variation called "base32hex" which preserves sort ordering, but there is no official "base64hex" equivalent.
Looks like that leaves creating a custom encoding like this.
EDIT:
This turned out to be very trivial to solve. Simply encode in base 64 then translate character to character with a custom table of characters that respects sort order.
Simply map from the standard Mime 64 characters:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
To something like this:
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz|~"
Then sorting will work.

UTF-8 string delimiter

I am parsing a binary protocol which has UTF-8 strings interspersed among raw bytes. This particular protocol prefaces each UTF-8 string with a short (two bytes) indicating the length of the following UTF-8 string. This gives a maximum string length 2^16 > 65 000 which is more than adequate for the particular application.
My question is, is this a standard way of delimiting UTF-8 strings?
I wouldn't call that delimiting, more like "length prefixing". Some people call them Pascal strings since in the early days the language Pascal was one of the popular ones that stored strings that way in memory.
I don't think there's a formal standard specifically for just that, as it's a rather obvious way of storing UTF-8 strings (or any strings of bytes for that matter). It's defined over and over as a part of many standards that deal with messages that contain strings, though.
UTF8 is not normally de-limited, you should be able to spot the multibyte characters in there by using the rules mentioned here: http://en.wikipedia.org/wiki/UTF-8#Description
i would use a delimiter which starts with 0x11......
but if you send raw bytes you will have to exclude this delimiter from the data\messages processed ,this means that if there is a user input similar to that delimiter, you will have to convert it.
if the user inputs any utf8 represented char you may simply send it as is.

Why do we use Base64?

Wikipedia says
Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.
But is it not that data is always stored/transmitted in binary because the memory that our machines have store binary and it just depends how you interpret it? So, whether you encode the bit pattern 010011010110000101101110 as Man in ASCII or as TWFu in Base64, you are eventually going to store the same bit pattern.
If the ultimate encoding is in terms of zeros and ones and every machine and media can deal with them, how does it matter if the data is represented as ASCII or Base64?
What does it mean "media that are designed to deal with textual data"? They can deal with binary => they can deal with anything.
Thanks everyone, I think I understand now.
When we send over data, we cannot be sure that the data would be interpreted in the same format as we intended it to be. So, we send over data coded in some format (like Base64) that both parties understand. That way even if sender and receiver interpret same things differently, but because they agree on the coded format, the data will not get interpreted wrongly.
From Mark Byers example
If I want to send
Hello
world!
One way is to send it in ASCII like
72 101 108 108 111 10 119 111 114 108 100 33
But byte 10 might not be interpreted correctly as a newline at the other end. So, we use a subset of ASCII to encode it like this
83 71 86 115 98 71 56 115 67 110 100 118 99 109 120 107 73 61 61
which at the cost of more data transferred for the same amount of information ensures that the receiver can decode the data in the intended way, even if the receiver happens to have different interpretations for the rest of the character set.
Your first mistake is thinking that ASCII encoding and Base64 encoding are interchangeable. They are not. They are used for different purposes.
When you encode text in ASCII, you start with a text string and convert it to a sequence of bytes.
When you encode data in Base64, you start with a sequence of bytes and convert it to a text string.
To understand why Base64 was necessary in the first place we need a little history of computing.
Computers communicate in binary - 0s and 1s - but people typically want to communicate with more rich forms data such as text or images. In order to transfer this data between computers it first has to be encoded into 0s and 1s, sent, then decoded again. To take text as an example - there are many different ways to perform this encoding. It would be much simpler if we could all agree on a single encoding, but sadly this is not the case.
Originally a lot of different encodings were created (e.g. Baudot code) which used a different number of bits per character until eventually ASCII became a standard with 7 bits per character. However most computers store binary data in bytes consisting of 8 bits each so ASCII is unsuitable for tranferring this type of data. Some systems would even wipe the most significant bit. Furthermore the difference in line ending encodings across systems mean that the ASCII character 10 and 13 were also sometimes modified.
To solve these problems Base64 encoding was introduced. This allows you to encode arbitrary bytes to bytes which are known to be safe to send without getting corrupted (ASCII alphanumeric characters and a couple of symbols). The disadvantage is that encoding the message using Base64 increases its length - every 3 bytes of data is encoded to 4 ASCII characters.
To send text reliably you can first encode to bytes using a text encoding of your choice (for example UTF-8) and then afterwards Base64 encode the resulting binary data into a text string that is safe to send encoded as ASCII. The receiver will have to reverse this process to recover the original message. This of course requires that the receiver knows which encodings were used, and this information often needs to be sent separately.
Historically it has been used to encode binary data in email messages where the email server might modify line-endings. A more modern example is the use of Base64 encoding to embed image data directly in HTML source code. Here it is necessary to encode the data to avoid characters like '<' and '>' being interpreted as tags.
Here is a working example:
I wish to send a text message with two lines:
Hello
world!
If I send it as ASCII (or UTF-8) it will look like this:
72 101 108 108 111 10 119 111 114 108 100 33
The byte 10 is corrupted in some systems so we can base 64 encode these bytes as a Base64 string:
SGVsbG8Kd29ybGQh
Which when encoded using ASCII looks like this:
83 71 86 115 98 71 56 75 100 50 57 121 98 71 81 104
All the bytes here are known safe bytes, so there is very little chance that any system will corrupt this message. I can send this instead of my original message and let the receiver reverse the process to recover the original message.
Encoding binary data in XML
Suppose you want to embed a couple images within an XML document. The images are binary data, while the XML document is text. But XML cannot handle embedded binary data. So how do you do it?
One option is to encode the images in base64, turning the binary data into text that XML can handle.
Instead of:
<images>
<image name="Sally">{binary gibberish that breaks XML parsers}</image>
<image name="Bobby">{binary gibberish that breaks XML parsers}</image>
</images>
you do:
<images>
<image name="Sally" encoding="base64">j23894uaiAJSD3234kljasjkSD...</image>
<image name="Bobby" encoding="base64">Ja3k23JKasil3452AsdfjlksKsasKD...</image>
</images>
And the XML parser will be able to parse the XML document correctly and extract the image data.
Why not look to the RFC that currently defines Base64?
Base encoding of data is used in
many situations to store or transfer
data in environments that, perhaps for
legacy reasons, are restricted to
US-ASCII [1] data.Base encoding can
also be used in new applications
that do not have legacy restrictions,
simply because it makes it possible
to manipulate objects with text
editors.
In the past, different applications
have had different requirements and
thus sometimes implemented base
encodings in slightly different
ways. Today, protocol specifications
sometimes use base encodings in
general, and "base64" in particular,
without a precise description or
reference. Multipurpose Internet Mail
Extensions (MIME) [4] is often used
as a reference for base64 without
considering the consequences for
line-wrapping or non-alphabet
characters. The purpose of this
specification is to establish common
alphabet and encoding
considerations. This will hopefully
reduce ambiguity in other
documents, leading to better
interoperability.
Base64 was originally devised as a way to allow binary data to be attached to emails as a part of the Multipurpose Internet Mail Extensions.
Media that is designed for textual data is of course eventually binary as well, but textual media often use certain binary values for control characters. Also, textual media may reject certain binary values as non-text.
Base64 encoding encodes binary data as values that can only be interpreted as text in textual media, and is free of any special characters and/or control characters, so that the data will be preserved across textual media as well.
It is more that the media validates the string encoding, so we want to ensure that the data is acceptable by a handling application (and doesn't contain a binary sequence representing EOL for example)
Imagine you want to send binary data in an email with encoding UTF-8 -- The email may not display correctly if the stream of ones and zeros creates a sequence which isn't valid Unicode in UTF-8 encoding.
The same type of thing happens in URLs when we want to encode characters not valid for a URL in the URL itself:
http://www.foo.com/hello my friend -> http://www.foo.com/hello%20my%20friend
This is because we want to send a space over a system that will think the space is smelly.
All we are doing is ensuring there is a 1-to-1 mapping between a known good, acceptable and non-detrimental sequence of bits to another literal sequence of bits, and that the handling application doesn't distinguish the encoding.
In your example, man may be valid ASCII in first form; but often you may want to transmit values that are random binary (ie sending an image in an email):
MIME-Version: 1.0
Content-Description: "Base64 encode of a.gif"
Content-Type: image/gif; name="a.gif"
Content-Transfer-Encoding: Base64
Content-Disposition: attachment; filename="a.gif"
Here we see that a GIF image is encoded in base64 as a chunk of an email. The email client reads the headers and decodes it. Because of the encoding, we can be sure the GIF doesn't contain anything that may be interpreted as protocol and we avoid inserting data that SMTP or POP may find significant.
Here is a summary of my understanding after reading what others have posted:
Important!
Base64 encoding is not meant to provide security
Base64 encoding is not meant to compress data
Why do we use Base64
Base64 is a text representation of data that consists of only 64 characters which are the alphanumeric characters (lowercase and uppercase), +, / and =.
These 64 characters are considered ‘safe’, that is, they can not be misinterpreted by legacy computers and programs unlike characters such as <, > \n and many others.
When is Base64 useful
I've found base64 very useful when transfering files as text. You get the file's bytes and encode them to base64, transmit the base64 string and from the receiving side you do the reverse.
This is the same procedure that is used when sending attachments over SMTP during emailing.
How to perform base64 encoding/decoding
Conversion from base64 text to bytes is called decoding.
Conversion from bytes to base64 text is called encoding. This is a bit different from how other encodings/decodings are named.
Dotnet and Powershell
Microsoft's Dotnet framework has support for encoding and decoding bytes to base64. Look for the Convert namespace in the mscorlib library.
Below are powershell commands you can use:
// Base64 encode PowerShell
// See: https://adsecurity.org/?p=478
$Text='This is my nice cool text'
$Bytes = [System.Text.Encoding]::Unicode.GetBytes($Text)
$EncodedText = [Convert]::ToBase64String($Bytes)
$EncodedText
// Convert from base64 to plain text
[System.Text.Encoding]::Unicode.GetString([Convert]::FromBase64String('VABoAGkAcwAgAGkAcwAgAG0AeQAgAG4AaQBjAGUAIABjAG8AbwBsACAAdABlAHgAdAA='))
Output>This is my nice cool text
Bash has a built-in command for base64 encoding/decoding. You can use it like this:
To encode to base64:
echo 'hello' | base64
To decode base64-encoded text to normal text:
echo 'aGVsbG8K' | base64 -d
Node.js also has support for base64. Here is a class that you can use:
/**
* Attachment class.
* Converts base64 string to file and file to base64 string
* Converting a Buffer to a string is known as decoding.
* Converting a string to a Buffer is known as encoding.
* See: https://nodejs.org/api/buffer.html
*
* For binary to text, the naming convention is reversed.
* Converting Buffer to string is encoding.
* Converting string to Buffer is decoding.
*
*/
class Attachment {
constructor(){
}
/**
*
* #param {string} base64Str
* #returns {Buffer} file buffer
*/
static base64ToBuffer(base64Str) {
const fileBuffer = Buffer.from(base64Str, 'base64');
// console.log(fileBuffer)
return fileBuffer;
}
/**
*
* #param {Buffer} fileBuffer
* #returns { string } base64 encoded content
*/
static bufferToBase64(fileBuffer) {
const base64Encoded = fileBuffer.toString('base64')
// console.log(base64Encoded)
return base64Encoded
}
}
You get the file buffer like so:
const fileBuffer = fs.readFileSync(path);
Or like so:
const buf = Buffer.from('hey there');
You can also use an API to do for you the encoding and encoding, here is one:
To encode, you pass in the plain text as the body.
POST https://mk34rgwhnf.execute-api.ap-south-1.amazonaws.com/base64-encode
To decode, pass in the base64 string as the body.
POST https://mk34rgwhnf.execute-api.ap-south-1.amazonaws.com/base64-decode
Fantasy example of when you might need base64
Here is a far fetched scenario of when you might need to use base64.
Suppose you are a spy and you're on a mission to copy and take back a picture of great value back to your country's intelligence.
This picture is on a computer that has no access to internet and no printer. All you have in your hands is a pen and a single sheet of paper. No flash disk, no CD etc. What do you do?
Your first option would be to convert the picture into binary 1s and 0s , copy those 1s and 0s to the paper one by one and then run for it.
However, this can be a challenge because representing a picture using only 1s and 0s as your alphabet will result in very many 1s and 0s. Your paper is small and you dont have time. Plus, the more 1s and 0s the more chances of error.
Your second option is to use hexadecimal instead of binary. Hexadecimal allows for 16 instead of 2 possible characters so you have a wider alphabet hence less paper and time required.
Still a better option is to convert the picture into base64 and take advantage of yet another larger character set to represent the data. Less paper and less time to complete. There you go!
Base64 instead of escaping special characters
I'll give you a very different but real example: I write javascript code to be run in a browser. HTML tags have ID values, but there are constraints on what characters are valid in an ID.
But I want my ID to losslessly refer to files in my file system. Files in reality can have all manner of weird and wonderful characters in them from exclamation marks, accented characters, tilde, even emoji! I cannot do this:
<div id="/path/to/my_strangely_named_file!#().jpg">
<img src="http://myserver.com/path/to/my_strangely_named_file!#().jpg">
Here's a pic I took in Moscow.
</div>
Suppose I want to run some code like this:
# ERROR
document.getElementById("/path/to/my_strangely_named_file!#().jpg");
I think this code will fail when executed.
With Base64 I can refer to something complicated without worrying about which language allows what special characters and which need escaping:
document.getElementById("18GerPD8fY4iTbNpC9hHNXNHyrDMampPLA");
Unlike using an MD5 or some other hashing function, you can reverse the encoding to find out what exactly the data was that actually useful.
I wish I knew about Base64 years ago. I would have avoided tearing my hair out with ‘encodeURIComponent’ and str.replace(‘\n’,’\\n’)
SSH transfer of text:
If you're trying to pass complex data over ssh (e.g. a dotfile so you can get your shell personalizations), good luck doing it without Base 64. This is how you would do it with base 64 (I know you can use SCP, but that would take multiple commands - which complicates key bindings for sshing into a server):
https://superuser.com/a/1376076/114723
One example of when I found it convenient was when trying to embed binary data in XML. Some of the binary data was being misinterpreted by the SAX parser because that data could be literally anything, including XML special characters. Base64 encoding the data on the transmitting end and decoding it on the receiving end fixed that problem.
Most computers store data in 8-bit binary format, but this is not a requirement. Some machines and transmission media can only handle 7 bits (or maybe even lesser) at a time. Such a medium would interpret the stream in multiples of 7 bits, so if you were to send 8-bit data, you won't receive what you expect on the other side. Base-64 is just one way to solve this problem: you encode the input into a 6-bit format, send it over your medium and decode it back to 8-bit format at the receiving end.
In addition to the other (somewhat lengthy) answers: even ignoring old systems that support only 7-bit ASCII, basic problems with supplying binary data in text-mode are:
Newlines are typically transformed in text-mode.
One must be careful not to treat a NUL byte as the end of a text string, which is all too easy to do in any program with C lineage.
What does it mean "media that are
designed to deal with textual data"?
That those protocols were designed to handle text (often, only English text) instead of binary data (like .png and .jpg images).
They can deal with binary => they can
deal with anything.
But the converse is not true. A protocol designed to represent text may improperly treat binary data that happens to contain:
The bytes 0x0A and 0x0D, used for line endings, which differ by platform.
Other control characters like 0x00 (NULL = C string terminator), 0x03 (END OF TEXT), 0x04 (END OF TRANSMISSION), or 0x1A (DOS end-of-file) which may prematurely signal the end of data.
Bytes above 0x7F (if the protocol that was designed for ASCII).
Byte sequences that are invalid UTF-8.
So you can't just send binary data over a text-based protocol. You're limited to the bytes that represent the non-space non-control ASCII characters, of which there are 94. The reason Base 64 was chosen was that it's faster to work with powers of two, and 64 is the largest one that works.
One question though. How is that
systems still don't agree on a common
encoding technique like the so common
UTF-8?
On the Web, at least, they mostly have. A majority of sites use UTF-8.
The problem in the West is that there is a lot of old software that ass-u-me-s that 1 byte = 1 character and can't work with UTF-8.
The problem in the East is their attachment to encodings like GB2312 and Shift_JIS.
And the fact that Microsoft seems to have still not gotten over having picked the wrong UTF encoding. If you want to use the Windows API or the Microsoft C runtime library, you're limited to UTF-16 or the locale's "ANSI" encoding. This makes it painful to use UTF-8 because you have to convert all the time.
Why/ How do we use Base64 encoding?
Base64 is one of the binary-to-text encoding scheme having 75% efficiency. It is used so that typical binary data (such as images) may be safely sent over legacy "not 8-bit clean" channels.
In earlier email networks (till early 1990s), most email messages were plain text in the 7-bit US-ASCII character set. So many early comm protocol standards were designed to work over "7-bit" comm links "not 8-bit clean".
Scheme efficiency is the ratio between number of bits in the input and the number of bits in the encoded output.
Hexadecimal (Base16) is also one of the binary-to-text encoding scheme with 50% efficiency.
Base64 Encoding Steps (Simplified):
Binary data is arranged in continuous chunks of 24 bits (3 bytes) each.
Each 24 bits chunk is grouped in to four parts of 6 bit each.
Each 6 bit group is converted into their corresponding Base64 character values, i.e. Base64 encoding converts three octets into four encoded characters. The ratio of output bytes to input bytes is 4:3 (33% overhead).
Interestingly, the same characters will be encoded differently depending on their position within the three-octet group which is encoded to produce the four characters.
The receiver will have to reverse this process to recover the original message.
What does it mean "media that are designed to deal with textual data"?
Back in the day when ASCII ruled the world dealing with non-ASCII values was a headache. People jumped through all sorts of hoops to get these transferred over the wire without losing out information.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources