I was wondering, what is the longest possible name length allowed by the Windows kernel?
E.g.: I know the kernel uses UNICODE_STRING structures to hold all object paths, and since the byte length of a wide-character string is stored inside a USHORT, that allows for a maximum path length of 2^15 - 1 characters. Is there a similar, hard restriction on a file name (rather than path)? (I don't care if NTFS or FAT32 imposes a particular restriction; I'm looking for the longest possible theoretically allowed name in the kernel, assuming no additional file system or shell restrictions.)
(Edit: For those wondering why this even matters, consider that normally, traversing a directory is achieved by FindFirstFile/FindNextFile calls, one call per file. Given the function named NtQueryDirectoryFile, which is the underlying system call and which returns multiple file names per call, it's actually possible to take advantage of this maximum-length restriction on the path to make an extremely-fast directory traverser that uses solely the stack as a buffer. Now I'm trying to extend that concept, and I need to know the maximum size of a file name.)
The maximum length of a path is 32,767 characters whereby each path component (directory or file) can have a maximum length of 255 characters (to be more exact, the value returned in the lpMaximumComponentLength parameter of the GetVolumeInformation function).
This is documented on MSDN.
Ah, I found this page myself that guarantees that file names can't be longer than 255 characters:
A pathname MUST be no more than 32,760 characters in length.
...
Each pathname component MUST be no more than 255 characters in length.
Which makes me wonder:
Why does Windows use ULONGs for file name lengths, when it uses USHORTs for path lengths?!
If anyone knows why this is, please post/comment! I'm rather curious. :)
Related
I plan to build a unique file name based on its content. For example, by its SHA256 hash. Files with the same content must have the same name.
The easiest way is to convert hash to a hex string. A file name will be 32 bytes length * 2 = 64 characters. This is pretty long name to operate with. How to make it shorter?
I implemented a sort of "Base32" coding - a vocabulary string that includes digits and 22 letters. I use only five bits of every byte to build file name with 32 characters. Much better.
I am looking for a balance between file name length and low collision probability. If the number of files is expected to be less than 500K, how long should the filename be? 8? 16? 24? 32?
Is there any recommended method to build short unique filenames at all?
If you use an N-bit cryptographic hash on M files, you can estimate the probability of at least one collision to be M2/2N+1
For 500K files, that's about 1/2N-37
Using base32, 16 chars gives probability of collision 1/243 -- a few trillion to 1 odds.
If that won't do, then 24 chars gives 1/283.
If you're willing to check them all and re-generate on collision, then 8 chars is fine.
Number of collisions depend on the content of the files, the hash-algorithm and the length of the hash.
In general: The longer the hash-value is the less likely are collisions (if your content does not especially provoke collisions).
You cannot avoid the possibility of collisions unless you use the content as file-name (or a lossless compression of it).
To shorten the filenames you could allow more different characters for the file-name. (But we aware what characters your OS allows and which you are willing to use).
I would go for a kind of base32 encoding to avoid problems with filesystems that do not distinguish between upper and lower case character.
So we all know that Windows programs by default are limited to dealing
with a maximum path length of 260 characters. However, this limit can easily be overcome by prefixing the path by the \\?\ character sequence.
For some reason, however, this isn't possible with relative paths, as MSDN says:
Because you cannot use the \\?\ prefix with a relative path,
relative paths are always limited to a total of MAX_PATH characters.
(source)
I don't really understand the reason why Microsoft decided to forbid relative paths to be prefixed with \\?\ so if there is some sort of rationale behind this decision, I'd be really glad to hear about it because it doesn't really make sense to me that \\?\ is only allowed for full paths.
My real question, though, is how to deal with this limitation: Should I simply call GetFullPathName() on relative paths to extend them to full paths, then add the \\?\ prefix, and then pass that path to fopen() etc., or what is the recommended way of dealing with this limitation?
You cannot use the \\?\ prefix with a relative path.
When relative path is passed to the system, it is parsed as absolute paths and then passed to the system. And as it is mentioned in the source:
The prefixes \\:\ are not used as part of the path itself. They
indicate that the path should be passed to the system with minimal
modification, which means that you cannot use forward slashes to
represent path separators, or a period to represent the current
directory, or double dots to represent the parent directory.
In some windows APIs, for example Module32Next, Module32First, Process32Next, Thread32Next, etc, programmers are forced to set the dwSize field of structure to the size of the structure. Why does Windows make us do that? Isn't these structures defined by Windows itself? Isn't the size a known constant?
PS: I looked into these functions and found that they just check if the size is equal to a hard coded constant.
By requiring the programmer to specify the size of the structure, Windows can tell which version of the structure the programmer is using. Some such structures have actually changed between different versions of Windows, and some haven't - but providing the size means that Microsoft have the option of changing it if they need to, without breaking existing applications.
Official info about Thread32First function says:
Thread32First changes dwSize to the number of bytes written to the
structure. This will never be greater than the initial value of
dwSize, but it may be smaller. If the value is smaller, do not rely on
the values of any members whose offsets are greater than this value.
I understand that, by specifying a value in dwSize, we tell Windows we don't need the other "members whose offsets are greater than this value".
(edited) After some tests, I believe now the correct answer is that from Harry Johnston's.
We are working with a number of unix based filesystems, all of which share a similar set of restrictions on that certain characters can't be used in the username fields. One of those restrictions is no "#" , "_", or "." in the names. Being unix there are a number of other restrictions.
So the question is if there is a good known algorithm that can take an email address and turn that into a predictable unix filename. We would need to reverse this at some point to get the email.
I've considered doing thing like "."->"DOT", "#"->"AT", etc. But there are size limitations and other things that are generally problematic. I could also optimize by being able to map the #xyz.com part of the email to a special char or something. Each implementation would only have at most 3 domains it would need to support. I'm hoping someone has found a solution without a huge number of tradeoffs.
UPDATE:
-The two target filesystems are AFS and NFS.
-Base64 doesn't work as it has not compatible characters. "/"
-Readable is preferable.
Seems like the best answer would be to replace the #xyz.com domain to a single non-standard character, and then have a function that could shrink the first part of a name to something that fits in the username length restrictions of the various filesystems. But what is a good function for that?
You could try a modified version of the URL percent (%) encoding scheme used on for URIs.
If the percent symbol isn't allowed on your particular filesystem(s), simply replace it with a different, allowed character (and remember to encode any occurrences of that character properly).
Using this method:
mail.address#server.com
Would become:
mail%2Eaddress%40server%2Ecom
Or, if you had to substitute (for example), the letter a instead of the % symbol:
ma61ila2Ea61ddressa40servera2Ecom
Not exactly humanly-readable perhaps, but easily enough processed through an encoding algorithm. For the best space efficiency, your escape character should be a character allowed by the filesystem, yet one that is not likely to appear frequently in an address.
This encoding scheme has the advantage that there is no size increase for most normal characters. The string length will ONLY go up for characters not supported by the filesystem.
Check out base64. Encoding and decoding is well defined.
I'd prefer this over rolling my own format any day.
Hmm, from your question I'm not totally clear on this point, but since you wanted some conversion I'm assuming that you want something that is at least human readable?
Each OS may have different restrictions, but are you close enough to the platforms that you would be able to find out/test what is acceptable in a username? If you could find three 'special' characters that you could use just to do a replace on '#', '.', '_' you would be good to go. (Is that comprehensive? if not you would need to make sure you know all of them otherwise you could clash.) I searched a bit trying to find whether there was a POSIX standard, but wasn't able to find anything, so that's why I think if you can just test what's valid that would be the most direct route.
With even one special character, you could do URL encoding, either with '%' if it's available, or whatever you choose if not, say '!", then { '#'->'!40", '_'->'!5F', '.'-> '!2E' }. (The spec [RFC1738] http://www.rfc-editor.org/rfc/rfc1738.txt) defines the characters as US-ASCII so you can just find a table, e.g. in wikipedia's ASCII article and look up the correct hex digits there.) Or, you could just do your own simple mapping since you don't need the whole ASCII set, you could just do a map with two characters per escaped character and have, say, '!a','!u','!p' for at, underscore, period.
If you have two special characters, say, '%', and '!', you could delimit text that represents the character, say, %at!, &us!, and '&pd!'. (This is pretty much html-style encoding, but instead of '&' and ';' you are using the available ones, and you're making up your own mnemonics.) Another idea is that you could use runs of a symbol to determine the translated character, where each new character flops which symbol is being used. (This conveniently stops the run if we need to put two of the disallowed characters next to each other.) So assume '%' and '!', with period being 1, underscore 2, and at-sign being three, 'mickey._sample_#fake.out' would become 'mickey%!!sample%%!!!fake%out'. There are other variations but this one is easy to code.
If none of this is an option (e.g. no symbols at all, just [a-zA-Z0-9]), then really I think the Base64 answer sounds about right. Really once we're getting to anything other than a simple replacement (and even that) it's already getting hard to type if that's the goal. But if you really need to try to keep the email mostly readable, what you do is implement some sort of escaping. I'm thinking use '0' as your escape character, so now '0' becomes '00', '#' becomes '01', '.' becomes '02', and '_' becomes '03'. So now, 'mickey01._sample_#fake.out'would become 'mickey0010203sample0301fake02out'. Not beautiful but it should work; since we escaped any raw 0's, just always make sure you define a mapping for whatever you choose as your escape char and you should be fine..
That's all I can think of atm. :) Definitely if there's no need for these usernames to be readable in the raw it seems like apparently Base64 won't work, since it can produce slashes. Heck, ok, just the 2-digit US-ASCII hex value for each character and you're done...] is a good way to go; there's lots of nice debugged, heavily field-tested code out there for it and it solves your problem quite handily. :)
Given...
- the limited set of characters allowed in various file systems
- the desire to keep the encoded email address short (both for human readability and for possible concerns with file system limitations)
...a possible approach may be a two steps encoding logic whereby the email is
first compressed using a lossless compression algorithm such as Lempel-Ziv, effectively turning it into a "binary" form, stored in a shorter array of bytes
then this array of bytes is encoded using a Base64-like algorithm
The idea is to minimize the size of the binary representation, so that the expansion associated with the storage inefficiency of the encoding -which can only store roughly 6 bits (and probably a bit less) per character-, doesn't cause the encoded string to be too long.
Without getting overly sophisticated for the compression nor the encoding, such a system would likely produce encoded strings that are maybe 4/5 of the input string size (the email address): the compression should easily half the size, but the encoding, say Base32, would grow the binary form size by 8/5.
Efforts in improving the compression ratio may allow the selection of more "wasteful" encoding schemes (with smaller character sets) and this may help making the output more human-readable and also more broadly safe on various flavors of file systems. For example whereby a Base64 seems optimal. space-wise, using only uppercase letter (base 26) may ensure portability of the underlying scheme to file systems where the file names are not case sensitive.
Another benefit of the initial generic compression is that few, if any, assumptions need to be made about the syntax of valid input key (email addresses here).
Ideas for compression:
LZ seems like a good choice, 'though one may consider primin its initial buffer with common patterns found in email addresses (example ".com" or even "a.com", "b.com" etc.). This initial buffer would ensure several instances of "citations" per compressed email address, hence a better compression ratio overall). To further squeeze a few bytes, maybe LZH or other LZ-variations could be used.
Aside from the priming of the buffer mentioned above, another customization may be to use a shorter buffer than typical LZ algorithms, since the string we have to compress (email address instances) are themselves very short and would not benefit from say a 512 bytes buffer. (Shorter buffer sizes allow shorter codes for the citations)
Ideas for encoding:
Base64 is not suitable as-is because of the slash (/), plus (+) and equal (=) characters. Alternate characters could be used to replace these; dash (-) comes to mind, but finding three charcters, allowed by all "flavors" of the targeted file systems may be a stretch.
Never the less, Base64 and its 4 output characters per 3 payload bytes ratio provide what is probably the barely achievable upper limit of storage efficiency [for an acceptable character set].
At the lower end of this efficiency, is maybe an ASCII representation of the Hexadeciamal values of the bytes in the array. This format with a doubling of the payload bytes may be acceptable, length-wise, and is interesting because of its simplicity (there is a direct and simple relation between each nibble (4 bits) in the input and characters in the encoded string.
Base32 whereby A thru Z encode 0 thru 25 and 0 thru 5 encode 26 thru 31, respectively, essentially variation of Base64 with an 8 output characters per 5 payload bytes ratio may be a very viable compromise.
I want to convert windows pathname to unique integer.
Eg:
For pathname C:\temp\a.out, if i add ascii value of all the characters, i get 1234. But some other path can also generate the same number. So, what is the best way to generate unique numbers for different pathnames?
Look into Hash functions. Make sure to consider the case-insensitive nature of most Windows filenames when performing the hash.
Most likely, the language you are using provides a library function (or collection of functions) which can take the hash of a string (or just data). SHA1 is popular and has low collision.
Here on Stackoverflow there are many questions pertaining to hash functions. To get you started, simply search for "hash function". This may be a useful SO question for your case: What is a performant string hashing function that results in a 32 bit integer with low collision rates?.
there are more possible pathnames than integers, therefore you can't have true uniqueness. You could settle for something like an MD5 hash.
Perfect hashing
Yes, you'll need to use some kind of hash function, simply because the domain of your input is greater than the range of your output. In other words, there are almost certainly more valid pathnames than there are numbers representable in your target language's data type.
So it will not be possible to completely avoid collisions. If this guarantee is essential to your application, you won't be able to do it by translation to integers.
How about something like this:
Use a hash of (String->n bits) for each directory level. Alloting 20 bits for each of 10 directory levels is clearly not going to scale, but maybe a telescoping level of bits, under the assumption that the lowest directory level will be the most populated -
e.g. if you have (from root) /A/B/C/D/E/F,
output some sort of n-bit number where
bits n/2 - n hashes F
bits n/4 - n/2 bits hashes E
n/8 - n/4 bits hashes D
etc. etc.
If this is on Unix, you could just grab its inode number. ls -i shows it on the command line. The stat() command allows you to retrive it from a program.
Soft links would show up as the same file, while hard links would show up as a different file. This may or may not be behavior you want.
I see a lot of folks talking about hashing. That could work, but theoretically if your hash does anything more than compress out integer values that are not allowable in file names, then you could have clashes. If that is unacceptable for you, then your hash is always going to be nearly as many digits as the file name. At that point, you might as well just use the file name.
Jimmy Said
there are more possible pathnames than
integers, therefore you can't have
true uniqueness. You could settle for
something like an MD5 hash.
I don't think there are more possible path names then integers. As a construction to create a unique number from a pathname we can convert each letter to a (two-digit) number (so from 10-25,26=., then other special chars, and 27 being / --this is assuming there are less then 89 different characters, else, we can move to three digit encoding)
home/nlucaroni/documents/cv.pdf
1724221427232130121027242318271324122827123136251315
This forms a bijection (although, if you count only valid path names then the surjective property fails, but normally one doesn't care about that holding) --Come up with a path that isn't an integer.
This number obviously doesn't fit in a 64_bit unsigned int (max being 18446744073709551615), so it's not practical, but this isn't the point of my response.
You can read here Best way to determine if two path reference to same file in C# how you can uniquely identify a path. You need three numbers (dwVolumeSerialNumber, nFileIndexHigh and nFileIndexLow), maybe you can combine those three numbers to a new number with three times more bits. See also here: What are your favorite extension methods for C#? (codeplex.com/extensionoverflow) .
To all the people saying "it's not possible because you have more possible paths than integers to store them in": no. The poster never specified an implementation language; some languages support arbitrary-length integers. Python, for example.
Say we take the 32,000 character paths as the limit mentioned in one of the other comments. If we have 256 different characters to use with paths we get:
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 32000L**256L
20815864389328798163850480654728171077230524494533409610638224700807216119346720596024478883464648369684843227908562015582767132496646929816279813211354641525848259018778440691546366699323167100945918841095379622423387354295096957733925002768876520583464697770622321657076833170056511209332449663781837603694136444406281042053396870977465916057756101739472373801429441421111406337458176000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L
>>>
Notice how Python represents that just fine? Yes, there's probably a better way to do it, but that doesn't mean it's impossible.
EDIT: rjack pointed out that it's actually 256^32000, not the other way around. Python still handles it just fine. The performance may leave something to be desired, but saying it's mathematically impossible is wrong.