What is the difference between path.Match and filepath.Match? - go

The documentation and the code for both seems the same.
Why two duplicate functions?
https://golang.org/pkg/path/#Match
https://golang.org/pkg/path/filepath/#Match

They are not "duplicates", they are part of different packages, so you should examine and interpret them in the context of their packages.
Package path "implements utility routines for manipulating slash-separated paths" independent of the platform / operating system.
Package path/filepath "implements utility routines for manipulating filename paths in a way compatible with the target operating system-defined file paths".
So for example path/filepath handles the path separator differences between operating systems.
If you look closer to the doc of filepath.Match(), it ends with:
On Windows, escaping is disabled. Instead, '\' is treated as path separator.
And there are also term interpretation differences. path.Match():
term:
'*' matches any sequence of non-/ characters
'?' matches any single non-/ character
And filepath.Match():
term:
'*' matches any sequence of non-Separator characters
'?' matches any single non-Separator character

The one in filepath package is operating system dependent and the one in path package always uses slash (/) as separator.

Related

Why does my canonicalized path get prefixed with \\?\

I'm working on a personal project that I was trying to solve via canonicalizing a relative path in Rust. However, whenever I do so, the new path gets prefixed with a strange \\?\ sequence. For example, something as simple as:
let p = fs::canonicalize(".").unwrap();
println!("{}", p.display());
will result in something like the following output:
\\?\C:\Users\[...]\rustprojects\projectname
This isn't a particular problem because I can accomplish what I'm attempting in other ways. However, it seems like odd behavior, especially if you are going to use the string form of the path in some way that requires accuracy. Why is this sequence of characters prepending the result, and how can I avoid it?
The \\?\ prefix tells Windows to treat the path as is, i.e. it disables the special meaning of . and .., special device names like CON are not interpreted and the path is assumed to be absolute. It also enables using paths up to 32,767 characters (UTF-16 code units), whereas otherwise the limit is 260 (unless you're on Windows 10, version 1607 or later, and your application opts in to longer paths).
Therefore, the \\?\ prefix ensures that you'll get a usable path; removing that prefix may yield a path that is unusable or that resolves to a different file! As such, I would recommend that you keep that prefix in your paths.

Is there any character that is illegal in file paths on every OS?

Is there any character that is guaranteed not to appear in any file path on Windows or Unix/Linux/OS X?
I need this because I want to join together a few file paths into a single string, and then split them apart again later.
In the comments, Harry Johnston writes:
The generic solution to this class of problem is to encode the file paths before joining them. For example, if you're dealing with single-byte strings, you could convert them to hex strings; so "hello" becomes "68656c6c6f". (Obviously that isn't the most efficient solution!)
That is absolutely correct. Please don't try to do anything "tricky" with filenames and reserved characters, because it will eventually break in some weird corner case and your successor will have a heck of a time trying to repair the damage.
In fact, if you're trying to be portable, I strongly recommend that you never attempt to create any filenames including any characters other than [a-z0-9_]. (Consider that common filesystems on both Windows and OS X can operate in case-insensitive mode, where FooBar.txt and FOOBAR.TXT are the same identifier.)
A decently compact encoding scheme for practical use would be to make a "whitelisted set" such as [a-z0-9_], and encode any character ch outside your "whitelisted set" as printf("_%2x", ch). So hello.txt becomes hello_2etxt, and hello_world.txt becomes hello_5fworld_2etxt.
Since every _ is escaped, you can use double-_ as a separator: the encoded string hello_2etxt__goodbye___2e_2e uniquely identifies the list of filenames ['hello.txt', 'goodbye', '..'].
You can use a newline character, or specifically CR (decimal code 13) or LF (decimal code 10) if you like. Whether this is suitable or not depends on what requirements you have with regard to displaying the concatenated string to the user - with this approach, it will print its parts on separate lines - which may be very good or very bad for the purpose (or you may not care...).
If you need the concatenated string to print on a single line, edit your question to specify this additional requirement; and we can go from there then.

Does the "SubstituteName" string in the PathBuffer of a REPARSE_DATA_BUFFER structure always start with the prefix "\??\", and if so, why?

I am trying to use Windows API functions compatible with Windows XP and up to find the target of a junction or symbolic link. I am using CreateFile to get a handle to the reparse point, then DeviceIoControl with the FSCTL_GET_REPARSE_POINT flag to read the reparse data into a REPARSE_DATA_BUFFER. Then, I use the offsets and lengths in the buffer to extract the SubstituteName and PrintName strings.
In Windows 8, extracting the PrintName works perfectly, giving me a normal path (ie c:\filename.ext), but in XP the PrintName section of the REPARSE_DATA_BUFFER seems to always have a length of 0, leaving me with an empty string.
Using the SubsituteName seems to work in both, but I always end up with a prefix of \??\ on the beginning of the file path (ie \??\c:\filename.ext). (as a side note, fsutil reparsepoint query shows the \??\ prefix as well).
I've read through much of the documentation on MSDN, but I can't find any explanation of this prefix. If the prefix is guaranteed to begin every SubstituteName, then I can just exclude the first four characters when I copy the file path from the buffer, but I'm not sure that this is the case. I would love to know if the "\??\" prefix appears in the SubstituteName for all Microsoft reparse points and why.
The Windows kernel has a "DOS Devices namespace" \DosDevices\ which is basically where anything you can open with CreateFile resides. (QueryDosDevice is a function which gives you all the members of that namespace.)
Because it's such a commonly used path, \??\ also redirects to that namespace. So, to the kernel, the path C:\Windows is invalid -- it should really be written as something like \??\C:\Windows. That's where this notation comes from.
The \??\ prefix means the path is not parsed. It is not guaranteed on every name, so you will have to look for the prefix on a per-name basis and skip it if present.
Update: I could not find any definitive documentation explaining exactly that \??\ actually represents, but here are some links that mention the \??\ prefix in action:
http://www.flexhex.com/docs/articles/hard-links.phtml
Note that szTarget string must contain the path prefixed with the "non-parsed" prefix "\??\", and terminated with the backslash character, for example "\??\C:\Some Dir\".
http://social.msdn.microsoft.com/Forums/en-US/vbgeneral/thread/908b3927-1ee9-4e03-9922-b4fd49fc51a6
http://mjunction.googlecode.com/svn-history/r5/trunk/MJunction/MJunction/JunctionPoint.cs
This prefix indicates to NTFS that the path is to be treated as a non-interpreted path in the virtual file system.
Private Const NonInterpretedPathPrefix As String = "\??\"

CGI DLL (built in Delphi) physical path

I deployed an CGI DLL built with Delphi 2007 on the Windows 2008 server. Internally I need to use the current DLL path.
Normally I can use GetModuleFileName or GetModuleName, but on the server they both return:
\\?\c:\my\correct\path
Why the first 4 characters? It looks like a network path? Is there any way to exclude those first 4 characters?
The pertinent documentation is this:
Maximum Path Length Limitation
In the Windows API (with some exceptions discussed in the following
paragraphs), the maximum length for a path is MAX_PATH, which is
defined as 260 characters. A local path is structured in the following
order: drive letter, colon, backslash, name components separated by
backslashes, and a terminating null character. For example, the
maximum path on drive D is "D:\some 256-character path string"
where "" represents the invisible terminating null character for
the current system codepage. (The characters < > are used here for
visual clarity and cannot be part of a valid path string.)
Note File I/O functions in the Windows API convert "/" to "\" as part
of converting the name to an NT-style name, except when using the
"\\?\" prefix as detailed in the following sections.
The Windows API has many functions that also have Unicode versions to
permit an extended-length path for a maximum total path length of
32,767 characters. This type of path is composed of components
separated by backslashes, each up to the value returned in the
lpMaximumComponentLength parameter of the GetVolumeInformation
function (this value is commonly 255 characters). To specify an
extended-length path, use the "\\?\" prefix. For example, "\\?\D:\very
long path".
Note The maximum path of 32,767 characters is approximate, because
the "\\?\" prefix may be expanded to a longer string by the system at
run time, and this expansion applies to the total length.
The "\\?\" prefix can also be used with paths constructed according to
the universal naming convention (UNC). To specify such a path using
UNC, use the "\\?\UNC\" prefix. For example, "\\?\UNC\server\share",
where "server" is the name of the computer and "share" is the name of
the shared folder. These prefixes are not used as part of the path
itself. They indicate that the path should be passed to the system
with minimal modification, which means that you cannot use forward
slashes to represent path separators, or a period to represent the
current directory, or double dots to represent the parent directory.
Because you cannot use the "\\?\" prefix with a relative path,
relative paths are always limited to a total of MAX_PATH characters.
As long as you are calling Unicode versions of Windows API functions, then there's no need to strip the "\\?\" prefix. Because the path that you have been handed is a valid path.
As we discovered in the comments, you were calling an ANSI version of an API function. And when you do that, the "\\?\" prefix is not valid. So, stick to Unicode API functions and it's all good!

Colon/Asterisk as a filename delimiter?

I'm looking for a character to use a filename delimiter (I'm storing multiple filenames in a plaintext string). Windows seems not to allow :, ?, *, <, >, ", |, / and \ in filenames. Obviously, \ and / can't be used, since they mean something within a path. Is there any reason why any of those others shouldn't be used? I'm just thinking that, similar to / or \, those other disallowed characters may have special meaning that I shouldn't assume won't be in path names. Of those other 7 characters, are any definitely safe or definitely unsafe to use for this purpose?
The characters : and " are also used in paths. Colon is the drive unit delimiter, and quotation marks are used when spaces are part of a folder or file name.
The charactes * and ? are used as wildcards when searching for files.
The characters < and > are used for redirecting an application's input and output to and from a file.
The character | is used for piping output from one application into input of another application.
I would choose the pipe character for separating file names. It's not used in paths, and its shape has a natural separation quality to it.
An alternative could be to use XML in the string. There is a bit of overhead and some characters need encoding, but the advantage is that it can handle any characters and the format is self explanatory and well defined.
Windows uses the semicolon as a filename delimiter: ;. look at the PATH environment variable, it is filled with ; between path elements.
(Also, in Python, the os.path.pathsep returns ";", while it expands to ":" on Unix)
I have used * in the past. The reason for portability to Linux/Unix. True, technically it can be used on those fileysystems too. In practice, all common OSes use it as a wildcard, thus it's quite uncommon in filenames. Also, people are not surprised if programs do break when you put a * in a filename.
Why dont you use any character with ALT key combination like ‡ (Alt + 0135) as delimiter ?
It is actually possible to create files programmatically with every possible character except \. (At least, this was true at one time and it's possible that Windows has changed its policy since.) Naturally, files containing certain characters will be harder to work with than others.
What were you using to determine which characters Windows allows?
Update: The set of characters allowed by Windows is also be determined by the underlying filesystem, and other factors. There is a blog entry on MSDN that explains this in more detail.
If all you need is the appearance of a colon, and will be creating it programatically, why not make use of a UTF-8 character that just looks like a colon?
My first choice would be the Modifier Letter (U+A789), as it is a typical RTL character and appears a lot like a colon. It is what I use when I need a full DateTime in the filename, such as file_2017-05-04_16꞉45꞉22_clientNo.jpg
I would stay away from characters like the Hebrew Punctuation Sof Pasuq (U+05C3), as it is a LTR character and may mess with how a system aligns the file name itself.

Resources