I'm looking for a character to use a filename delimiter (I'm storing multiple filenames in a plaintext string). Windows seems not to allow :, ?, *, <, >, ", |, / and \ in filenames. Obviously, \ and / can't be used, since they mean something within a path. Is there any reason why any of those others shouldn't be used? I'm just thinking that, similar to / or \, those other disallowed characters may have special meaning that I shouldn't assume won't be in path names. Of those other 7 characters, are any definitely safe or definitely unsafe to use for this purpose?
The characters : and " are also used in paths. Colon is the drive unit delimiter, and quotation marks are used when spaces are part of a folder or file name.
The charactes * and ? are used as wildcards when searching for files.
The characters < and > are used for redirecting an application's input and output to and from a file.
The character | is used for piping output from one application into input of another application.
I would choose the pipe character for separating file names. It's not used in paths, and its shape has a natural separation quality to it.
An alternative could be to use XML in the string. There is a bit of overhead and some characters need encoding, but the advantage is that it can handle any characters and the format is self explanatory and well defined.
Windows uses the semicolon as a filename delimiter: ;. look at the PATH environment variable, it is filled with ; between path elements.
(Also, in Python, the os.path.pathsep returns ";", while it expands to ":" on Unix)
I have used * in the past. The reason for portability to Linux/Unix. True, technically it can be used on those fileysystems too. In practice, all common OSes use it as a wildcard, thus it's quite uncommon in filenames. Also, people are not surprised if programs do break when you put a * in a filename.
Why dont you use any character with ALT key combination like ‡ (Alt + 0135) as delimiter ?
It is actually possible to create files programmatically with every possible character except \. (At least, this was true at one time and it's possible that Windows has changed its policy since.) Naturally, files containing certain characters will be harder to work with than others.
What were you using to determine which characters Windows allows?
Update: The set of characters allowed by Windows is also be determined by the underlying filesystem, and other factors. There is a blog entry on MSDN that explains this in more detail.
If all you need is the appearance of a colon, and will be creating it programatically, why not make use of a UTF-8 character that just looks like a colon?
My first choice would be the Modifier Letter (U+A789), as it is a typical RTL character and appears a lot like a colon. It is what I use when I need a full DateTime in the filename, such as file_2017-05-04_16꞉45꞉22_clientNo.jpg
I would stay away from characters like the Hebrew Punctuation Sof Pasuq (U+05C3), as it is a LTR character and may mess with how a system aligns the file name itself.
Related
Is there any character that is guaranteed not to appear in any file path on Windows or Unix/Linux/OS X?
I need this because I want to join together a few file paths into a single string, and then split them apart again later.
In the comments, Harry Johnston writes:
The generic solution to this class of problem is to encode the file paths before joining them. For example, if you're dealing with single-byte strings, you could convert them to hex strings; so "hello" becomes "68656c6c6f". (Obviously that isn't the most efficient solution!)
That is absolutely correct. Please don't try to do anything "tricky" with filenames and reserved characters, because it will eventually break in some weird corner case and your successor will have a heck of a time trying to repair the damage.
In fact, if you're trying to be portable, I strongly recommend that you never attempt to create any filenames including any characters other than [a-z0-9_]. (Consider that common filesystems on both Windows and OS X can operate in case-insensitive mode, where FooBar.txt and FOOBAR.TXT are the same identifier.)
A decently compact encoding scheme for practical use would be to make a "whitelisted set" such as [a-z0-9_], and encode any character ch outside your "whitelisted set" as printf("_%2x", ch). So hello.txt becomes hello_2etxt, and hello_world.txt becomes hello_5fworld_2etxt.
Since every _ is escaped, you can use double-_ as a separator: the encoded string hello_2etxt__goodbye___2e_2e uniquely identifies the list of filenames ['hello.txt', 'goodbye', '..'].
You can use a newline character, or specifically CR (decimal code 13) or LF (decimal code 10) if you like. Whether this is suitable or not depends on what requirements you have with regard to displaying the concatenated string to the user - with this approach, it will print its parts on separate lines - which may be very good or very bad for the purpose (or you may not care...).
If you need the concatenated string to print on a single line, edit your question to specify this additional requirement; and we can go from there then.
How should I use 'sed' command to find and replace the given word/words/sentence without considering any of them as the special character?
In other words hot to treat find and replace parameters as the plain text.
In following example I want to replace 'sagar' with '+sagar' then I have to give following command
sed "s/sagar/\\+sagar#g"
I know that \ should be escaped with another \ ,but I can't do this manipulation.
As there are so many special characters and theie combinations.
I am going to take find and replace parameters as input from user screen.
I want to execute the sed from c# code.
Simply, I do not want regular expression of sed to use. I want my command text to be treated as plain text?
Is this possible?
If so how can I do it?
While there may be sed versions that have an option like --noregex_matching, most of them don't have that option. Because you're getting the search and replace input by prompting a user, you're best bet is to scan the user input strings for reg-exp special characters and escape them as appropriate.
Also, will your users expect for example, their all caps search input to correctly match and replace a lower or mixed case string? In that case, recall that you could rewrite their target string as [Ss][Aa][Gg][Aa][Rr], and replace with +Sagar.
Note that there are far fewer regex characters used on the replacement side, with '&' meaning "complete string that was matched", and then the numbered replacment groups, like \1,\2,.... Given users that have no knowledge or expectation that they can use such characters, the likelyhood of them using is \1 in their required substitution is pretty low. More likely they may have a valid use for &, so you'll have to scan (at least) for that and replace with \&. In a basic sed, that's about it. (There may be others in the latest gnu seds, or some of the seds that have the genesis as PC tools).
For a replacement string, you shouldn't have to escape the + char at all. Probably yes for \. Again, you can scan your user's "naive" input, and add escape chars as need.
Finally if you're doing this for a "package" that will be distributed, and you'll be relying on the users' version of sed, beware that there are many versions of sed floating around, some that have their roots in Unix/Linux, and others, particularly of super-sed, that (I'm pretty sure) got started as PC-standalones and has a very different feature set.
IHTH.
I am getting errors from customers who are uploading files with a colon in the file name, i.e. C:/uploads/test : doc.html
I assume that some Unix or Linux system is generating the file but I'm not sure how the users are saving them with the invalid filename. I have coded a piece that should rename the document on upload. My problem is that I can't test it because I can't get a file on Windows that has a colon in the filename.
I found a very similar character to a colon, "꞉" it is a unicode character called a Modifier Letter Colon. This has no space like the fullwidth colon and is pretty much exactly the same as a regular colon but the symbol works. You can either copy and paste it from above or you can use the code point, U+A789
A colon is an invalid character for a Windows file name. You won't be able to allow ':' in the file name, but you can work around it.
You can either do what it sounds like you have already done; create a script that replaces these invalid characters with valid ones on the UNIX side. Or, you can take care of this on the Windows server with File Name Character Translation: http://support.microsoft.com/kb/289627
Other replacements I've found for reserved characters are
” ‹ › ⁎ ∕ ⑊ \︖ ꞉ ⏐
For example in python you could do:
fixed_name = orig_name.replace('\\\\','⑊')
forbidden_characters = '"*/:<>?\|'
unicode_characters = '”⁎∕꞉‹›︖\⏐'
for a, b in zip(forbidden_characters, unicode_characters):
fixed_name = fixed_name.replace(a, b)
It's probable from the filename you provided that the character you have in your filenames is not a literal colon :, which is a reserved character in Windows filenames, but a fullwidth colon : Instead. It's a Unicode character that looks very much like a colon, visually surrounded by spaces that you can't remove. You can handle it the very same way as any Unicode chacter, the code point is U+FF1A.
You can use the CJK(Chinese/Japan/Korean) one
":"
which is pretty generic.
Currently, you would use WSL, url for instructions: https://learn.microsoft.com/en-us/windows/wsl/install-win10.
You could then create a colon in your Linux Distro.
HOW TO NAME A FILE OR FOLDER USING A SYMBOL THAT LOOKS LIKE A COLON
In the example below, the font size is 12, with the exception of the symbol, which is set to Subscript, Bold and a font size of 16. The character code for the colon-like symbol is 02F8.
The reason for the Subscript setting is to position the symbol lower relative to its vertical position. The bold and larger font settings are applied so that the symbol is more discernible on the page, and have no affect when used in a file or folder name.
Example: (C˸) Symbol – Subscript, Calibri, Bold and font size of 16.
*Using Windows 7, and Word 2007
My script downloads files from the net and then it saves them under the name taken from the same web server. I need a filter/remover of invalid characters for file/folder names under Windows NTFS.
I would be happy for multi platform filter too.
NOTE: something like htmlentities would be great....
Like Geo said, by using gsub you can easily convert all invalid characters to a valid character. For example:
file_names.map! do |f|
f.gsub(/[<invalid characters>]/, '_')
end
You need to replace <invalid characters> with all the possible characters that your file names might have in them that are not allowed on your file system. In the above code each invalid character is replaced with a _.
Wikipedia tells us that the following characters are not allowed on NTFS:
U+0000 (NUL)
/ (slash)
\ (backslash)
: (colon)
* (asterisk)
? (question mark)
" (quote)
< (less than)
(greater than)
| (pipe)
So your gsub call could be something like this:
file_names.map! { |f| f.gsub(/[\x00\/\\:\*\?\"<>\|]/, '_') }
which replaces all the invalid characters with an underscore.
filename_string.gsub(/[^\w\.]/, '_')
Explanation: Replace everything except word-characters (letter, number, underscore) and dots
I think your best bet would be gsub on the filename. One of the things I know you'll need to delete/replace is :.
I don't know how you plan to use those files later, but pretty much most reliable solution would be to keep the original filenames in a db table (or otherwise serialized hash), and name physical files after the unique ID that you (or the database) generated.
PS Another advantage of this approach is that you don't have to worry about the files with the same names (or different names that filter to same names).
For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor