I read that a tar entry type of 'L' (76) is used by gnu tar and gnu-compliant tar utilities to indicate that the next entry in the archive has a "long" name. In this case the header block with the entry type of 'L' usually encodes the name ././#LongLink .
My question is: where is the format of the next block described?
The format of a tar archive is very simple: it is just a series of 512-byte blocks. In the normal case, each file in a tar archive is represented as a series of blocks. The first block is a header block, containing the file name, entry type, modified time, and other metadata. Then the raw file data follows, using as many 512-byte blocks as required. Then the next entry.
If the filename is longer than will fit in the space allocated in the header block, gnu tar apparently uses what's known as "the ././#LongLink trick". I can't find a precise description for it.
When the entry type is 'L', how do I know how long the "long" filename is? Is the long name limited to 512 bytes, in other words, whatever fits in one block?
Most importantly: where is this documented?
Just by observation of a single archive here's what I surmised about the 'L' entry type in tar archives, and the "././#LongLink" name:
The 'L' entry is present in a header for a series of 1 or more 512-byte blocks that hold just the filename for a file or directory with a name over 100 chars. For example, if the filename is 1200 chars long, then the size in the header block will be 1200, and there will be 3 additional blocks with filename data; the last block is partially filled.
Following that series is another header block, in the traditional form - a header with type '0' (regular file) or '5' (directory), followed by the appropriate number of data blocks with the entry data. In the header for this series, the name will be truncated to the first 100 characters of the actual name.
EDIT
See my implementation here:
http://cheesoexamples.codeplex.com/SourceControl/changeset/view/99885#1868643
Note that the information about all of that can be found in the libtar project:
http://www.feep.net/libtar/
The proposed header is libtar.h (opposed to the POSIX tar.h) which clearly includes a long filename, and long symbolic link.
Get the "fake" headers + data for the long filenames/links then the "real" header (except for the actual filename and symbolic link) after that.
HEADER type 'L'
BLOCKS of data with the real long filename
HEADER type 'K'
BLOCKS of data with the real symbolic link
HEADER type '0' (or '5' for directory, etc.)
BLOCKS of data with the actual file contents
Of course, under MS-Windows, you probably won't handle symbolic links, although with Win7 it is said that symbolic links under MS-Windows are working (finally—this is now official in Win10!)
Pertinent definition from libtar.h:
/* GNU extensions for typeflag */
#define GNU_LONGNAME_TYPE 'L'
#define GNU_LONGLINK_TYPE 'K'
Related
I was given a file that seems to be encoded in UTF-8, but every byte that should start with 1 starts with 0.
E.g. in place where one would expect polish letter 'ę', encoded in UTF-8 as \o304\o231, there is \o104\o031. Or, in binary, there is 01000100:00011001 instead of 11000100:10011001.
I assume that this was not done on purpose by evil file creator who enjoys my headache, but rather is a result of some erroneous operations performed on a correct UTF-8 file.
The question is: what "reasonable" operations could be the cause? I have no idea how the file was created, probably it was exported by some unknown software, than could have been compressed, uploaded, copied & pasted, converted to another encoding etc.
I'll be greatful for any idea : )
For a security test, I need to pass a file that contains null characters in its content and its filename.
For the body content, it's easy to use printf:
$ printf "Hello\00, Null!" > containsnull.txt
$ xxd contains.null
0000000: 4865 6c6c 6f00 2c20 4e75 6c6c 21 Hello., Null!
But how can I create a file with the null bytes in the name?
Note: A solution in bash, python or nodejs is preferred if possible
It's impossible to create a file name containing a null byte through POSIX or Windows APIs. On all the Unix systems that I'm aware of, it's impossible to create a file name containing a null byte, even with an ill-behaved application that bypasses the normal API, because the kernel itself treats all of its file name inputs as null-terminated strings. I believe this is true on Windows as well but I'm not completely sure.
As an application programmer, in terms of security, this means that you don't need to worry about a file name containing null bytes, if you're sure that what you have is the name of a file. On the other hand, if you're given a string and told to use it as a file name, for example if you're programming a server and letting the client choose file names, you need to ensure that this string does not contain null bytes. That's just one requirement among others including the string length, the presence of a directory separator (/ or \), reserved names (. and .., reserved Windows file names such as nul.txt or prn), etc. On most Unix systems, on their native filesystem, the constraints for a file name are: no null byte or slash, length between 1 and some maximum, and the two names . and .. are reserved. Windows, and non-native filesystems on Unix, have additional constraints (it's possible to put a / in a file name through direct kernel calls on Windows).
To put a null byte in a file's contents, just write a string to a file using any language that allows null bytes in strings. In bash, you cannot store a null byte in a string, so you need to use another method such as printf '\0' or echo "abc" | tr b '\0'.
You don't have to worry about file names containing null bytes on Unixes and Windows because they cannot.
However, a file names that is being treated as UTF-8 can specify the NUL character (U+0000) using invalid "overlong" sequences: two, three or four-byte UTF-8 sequences that have all zeros in their code point payload bits.
This can be a security issue. For instance, A UTF-8 decoder that doesn't check for this can end up generating a wchar_t character value of 0 which then unexpectedly terminates the wide character string.
For instance, the byte sequence C0 80 is an overlong encoding for NUL. This is evidently used by something called "Modified UTF-8" specifically for the purpose of encoding NUL characters that don't terminate the C string being used to hold the UTF-8.
If you're doing security testing, this is relevant; you can test whether programs are susceptible to NUL character (and other) injection via overlong encodings.
Try $'\u000d'
Not actually a null byte, but probably close enough to confuse people since you have to look really close to see that the last character is a D and not a 0, as it will usually print (if not just a blank) as the little box with hex codes in it.
Discovered this when I found a directory in my $HOME, named that...
I'm developing a software that stores its data in a binary file format. However, as a courtesy to innocent shell users that might cat to inspect the contents of such a file, I'm thinking of having an ASCII-compatible "magic string" in the start of the file that tells the name and the version of the binary format.
I'm thinking of having at least ten rows (\n) in the message so that head by default settings doesn't hit the binary part.
Now, I wonder if there is any control character or escape code that would hint to the shell that the following content isn't interpretable as printable text, and should be just ignored? I tried 0x00 (the null byte) and 0x04 (ctrl-D) but they seem to be just ignored when catting the file.
Cat regards a file as text. There is no way you can trigger an end-of-file, since EOF is not actually any character.
The other way around works of course; specifying a format that only start reading binary format from a certain character on.
I have created an ach file which, in a text editor looks exactly like a valid ach file. When I open it in an ACH viewer tool I get an error saying that the first character must be 1. I found this in the NACHA file specs 'Picture: This is the type of bit the ACH system is expecting to see. A 9 indicates a numeric value and an X indicates an
alphabetic value. If you put a letter in a PIC 9 position, the system will reject the field. If you see a number in parentheses
after the X or 9, that indicates the number of characters in that field. For example 9(10) means that field contains 10
numeric characters.'
The first position in the file is supposed to have content 1 in Picture format of size 1. I don't understand what do I need to do to fix this?
I finally downloaded a Hex file explorer and saw that the valid ACH file and my file both had different first characters. I found out that the ACH file needs the data in the ASCII format. All I had to do was when I populated the ACH file with data, I converted the data to ASCII before writing it.
I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.
Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?
TL;DR
Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.
Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.
Check Last Kilobyte for File Trailer
This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
Therefore, the following will work by looking for the file trailer instruction within that range:
def valid_file_trailer? filename
File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end
A Stricter Check of the File Trailer via Regex
However, the ISO standard is both more complex and a lot more strict. It says, in part:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).
Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:
def valid_file_trailer? filename
pattern = /^startxref\n\d+\n%%EOF\n\z/m
File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end