A hint for end of ASCII data in a binary file - shell

I'm developing a software that stores its data in a binary file format. However, as a courtesy to innocent shell users that might cat to inspect the contents of such a file, I'm thinking of having an ASCII-compatible "magic string" in the start of the file that tells the name and the version of the binary format.
I'm thinking of having at least ten rows (\n) in the message so that head by default settings doesn't hit the binary part.
Now, I wonder if there is any control character or escape code that would hint to the shell that the following content isn't interpretable as printable text, and should be just ignored? I tried 0x00 (the null byte) and 0x04 (ctrl-D) but they seem to be just ignored when catting the file.

Cat regards a file as text. There is no way you can trigger an end-of-file, since EOF is not actually any character.
The other way around works of course; specifying a format that only start reading binary format from a certain character on.

Related

How can I create a file with null bytes in the filename?

For a security test, I need to pass a file that contains null characters in its content and its filename.
For the body content, it's easy to use printf:
$ printf "Hello\00, Null!" > containsnull.txt
$ xxd contains.null
0000000: 4865 6c6c 6f00 2c20 4e75 6c6c 21 Hello., Null!
But how can I create a file with the null bytes in the name?
Note: A solution in bash, python or nodejs is preferred if possible
It's impossible to create a file name containing a null byte through POSIX or Windows APIs. On all the Unix systems that I'm aware of, it's impossible to create a file name containing a null byte, even with an ill-behaved application that bypasses the normal API, because the kernel itself treats all of its file name inputs as null-terminated strings. I believe this is true on Windows as well but I'm not completely sure.
As an application programmer, in terms of security, this means that you don't need to worry about a file name containing null bytes, if you're sure that what you have is the name of a file. On the other hand, if you're given a string and told to use it as a file name, for example if you're programming a server and letting the client choose file names, you need to ensure that this string does not contain null bytes. That's just one requirement among others including the string length, the presence of a directory separator (/ or \), reserved names (. and .., reserved Windows file names such as nul.txt or prn), etc. On most Unix systems, on their native filesystem, the constraints for a file name are: no null byte or slash, length between 1 and some maximum, and the two names . and .. are reserved. Windows, and non-native filesystems on Unix, have additional constraints (it's possible to put a / in a file name through direct kernel calls on Windows).
To put a null byte in a file's contents, just write a string to a file using any language that allows null bytes in strings. In bash, you cannot store a null byte in a string, so you need to use another method such as printf '\0' or echo "abc" | tr b '\0'.
You don't have to worry about file names containing null bytes on Unixes and Windows because they cannot.
However, a file names that is being treated as UTF-8 can specify the NUL character (U+0000) using invalid "overlong" sequences: two, three or four-byte UTF-8 sequences that have all zeros in their code point payload bits.
This can be a security issue. For instance, A UTF-8 decoder that doesn't check for this can end up generating a wchar_t character value of 0 which then unexpectedly terminates the wide character string.
For instance, the byte sequence C0 80 is an overlong encoding for NUL. This is evidently used by something called "Modified UTF-8" specifically for the purpose of encoding NUL characters that don't terminate the C string being used to hold the UTF-8.
If you're doing security testing, this is relevant; you can test whether programs are susceptible to NUL character (and other) injection via overlong encodings.
Try $'\u000d'
Not actually a null byte, but probably close enough to confuse people since you have to look really close to see that the last character is a D and not a 0, as it will usually print (if not just a blank) as the little box with hex codes in it.
Discovered this when I found a directory in my $HOME, named that...

Unpacking COMP-3 digit using Record Editor/Jrecord

I have created layout based on cobol copybook.
Layout snap-shot:
I tried to load data also selecting same layout, it gives me wrong result for some columns. I try using all binary numeric type.
CLASS-ORDER-EDGE
DIV-NO-EDG
OFFICE-NO-EDG
REG-AREA-NO-EDG
CITY-NO-EDG
COUNTY-NO-EDG
BILS-COUNT-EDG
REV-AMOUNT-EDG
USAGE-QTY-EDG
GAS-CCF-EDG
result snapshot
Input file can be find below attachment
enter link description here
or
https://drive.google.com/open?id=0B-whK3DXBRIGa0I0aE5SUHdMTDg
Expected output:
Related thread
Unpacking COMP-3 digit using Java
First Problem you have done an EBCDIC --> ascii conversion on the file !!!!
The EBCDIC --> ascii conversion will also try and convert binary fields as well as text.
For example:
Comp-3 value hex hex after Ascii conversion
400 x'400c' x'200c' x'40' is the ebcdic space character
it gets converted to the ascii
space character x'20'
You need to do binary transfer, keeping the file as ebcdic:
Check the file on the Mainframe if it has a RECFM=FB you can do a transfer
If the file is RECFM=VB make sure you transfer the RDW (Record Descriptor word) (or copy the VB file to a FB file on the mainframe).
Other points:
You will have to update RecordEditor/JRecord
The font will need to be ebcdic (cp037 for US ebcdic; for other lookup)
The FileStructure/FileOrganisation needs to change (Fixed length / VB)
Finally
BILS-Count-EDG is either 9 characters long or starts in column 85 (and is 8 bytes long).
You should include Xml in as text not copy a picture in.
In the RecordEditor if you Right click >>> Edit Record; it will show the fields as Value, Raw Text and Hex. That is useful for seeing what is going on
You do not seem to accept many answers; it is not relevant whether the answer solves your problem; it is whether the answer is correct answer for the question.

How do I create SAS fixed-format output containing end-of-line control characters?

I am using SAS's FILE statement to output a text file having fixed format (RECFM=F).
I would like each row to end in a end-of-line control character(s) such as linefeed/carriage return. I tried the FILE statement's option TERMSTR=CRLF but still I see no end-of-line control characters in the output file.
I think I could use the PUT statement to insert the desired linefeed and carriage return control characters, but would prefer a cleaner method. Is it a reasonable thing to expect of the FILE statement? (Is it a reasonable expectation for outputting fixed format data?)
(Platform: Windows v6.1.7600, SAS for Windows v9.2 TS Level 2M3 W32_VSPRO platform)
Do you really need to use RECFM=F? You can still get fixed length output with V:
data _null_;
file 'c:\temp\test.txt' lrecl=12 recfm=V;
do i=1 to 5;
x=rannor(123);
put #1 i #4 x 6.4;
end;
run;
By specifying where you want the data to go (#1 and #3) and the format (6.4) along with lrecl you will get fixed length output.
There may be a work-around, but I believe SAS won't output a line-ending with the Fixed format.

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources