More specifically, what is the authoritative source for that information?
This may look like a non-programming question, but I need to know whether a registry path fed to my code contains a regular expression or not. I decided the best way to do that is assume that any occurrence of an invalid character (like '*') means a wildcard search.
For allowable key and value names, see the MSDN page on Structure of the Registry. In particular:
Each key has a name consisting of one or more printable characters.
Key names are not case sensitive. Key names cannot include the
backslash character (\), but any other printable character can be
used. Value names and data can include the backslash character.
Registry value types are explained in detail on MSDN here, in case you need to know the allowable values.
For all things Windows, MSDN has to be the authoritative source -- the article on Registry Element Size Limits implies Unicode is good and Structure of the Registry says that backslash and non-printable characters are disallowed in key names. Values merely have to be entirely printable characters.
Just did an experiment with the Windows 7 registry: programmatically creating a key name with the 01 Hex (ASCII SOH) character in front of the word 'TEST' (in Delphi that is the string: #1'Test'). This is something that REGEDIT will not allow you to do by typing - even with ALT-Keypad operations.
Not only did it create the key, it showed the key in REGEDIT as having a 'wide' space where the #1 character resided.
Copying and pasting this new subkey name into TEXTPAD allowed me to verify that it was indeed still a #1 character.
I've never read anywhere that #1 is deemed 'printable', but in Windows anything other than 00 Hex can be put into a print string and literally anything can be sent to a printer, so I guess the MSDN statement about this limitation is an oxymoron: because in Windows being a character implies being printable, ergo unprintable character becomes ...well, meaningless.
Whilst you cannot type that #1 character directly into REGEDIT as a keyname (using the ALT-keypad-number entry method), you can nontheless paste it back from TEXTPAD to REGEDIT as part of a rename-operation. REGEDIT will even complain if you paste it to rename another peer subkey to your original one because the 'specified key already exists'.
Interestingly, I also experimented with the character #256 (which is no-longer ASCII, but is theoretically a Unicode Widechar, but not necessarily one deemed as "printable" if any parts of the input, storage or output mechanisms reject it).
Whilst I could create such a key programmatically, and see a strange looking 'A' in REGEDIT, it became somewhat less reliable in cut-and-paste. I'm guessing that the clipboard operations and interactions with different applications make this sort of thing a very dubious practice since TEXTPAD, for instance, might be making assumptions about whether you were pasting byte characters or wide characters that don't quite match what REGEDIT put into the clipboard - and vice-versa. If the code behind these operations are just expecting ANSI strings or UTF-16 Wide-Strings, and are being given something different, including byte-order differences and UTF-8 or similar differences that they were not expecting, then things are very likely to go wrong.
Finally, I experimented with an attempt to inject a widechar with order 0FFFF hex. That did not actually give any visual presence of the character in REGEDIT - how "unprintable" is that, then?. But the name did include the invisible character. I confirmed this by actually trying to create a separate peer subkey in REGEDIT without the offending character and as a result obtained what visually looked like two identical keys!
So in summary: It seems that you can put literally any character into a subkey name as long as it isn't a '\'. But it probably is not a very good idea to do so. And I think the term 'unprintable' in Windows generally only applies to 00 hex - and that is because it is usually used as a string terminator and therefore is a little bit difficult to 'send' through the registry API as a character!
What is quite worrying is the ability that this gives hackers to confuse and mislead. You could quite literally create a whole raft of registry subkeys that appear to have no names at all and can only be meaningfully used by applications, not humans. Yes, you could do that with space-characters, but some unicode characters (like FFFFh) have no width, and you can use any number of them together to create a unique and invisible name, or parts in a name! This makes them almost impossible to detect without using a laborious cut-and-paste, or a dedicated automated tool. In REGEDIT they all just look like identically named, or indeed unnamed, keys.
Related
[Edit/Disclaimer]: Comments pointed out that I have to clarify the encoding the user uses. Will update accordingly
I have a customer from China who recently reported an issue with their filenames on Windows. The software works with most Chinese characters, but it seems he has found one file that fails.
Unfortunately, they are not able to send me over the filename as neither zipping nor transmitting the file through other mediums seem to preserve the filename.
What is the easiest way (e.g. through Python) to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?
Unicode strings are encoded as a series of bytes. The rules of what a series of bytes visually looks like to you in an operating system, is what operating systems use to turn bytes into characters.
Given that Windows uses a (variation of-) Unicode, and you say you have a character that's not in unicode, it also means that there is simply no way to represent that character.
Imagine if unicode only contained the numbers 0-9, and you ask someone how to encode the letter A. There's no answer to this, because only 0-9 are defined.
You could make up a new unicode codepoint for your character, but then operating systems won't know what to do with that unless you also make your own font files.
I somehow doubt that that's what you want to do though, but it's an option. Could your customer rename the file before sending it to you?
I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.
I know that this is a little vague, so for context, think of it as "a character you could tweet," or something like that. My question is how many valid unicode characters are there that a browser or a service that supports utf8 could resolve, in such a way that a utf8 browser could copy and paste it around without any issues.
I guess what I don't want is the full character space, because I know a lot of it is reserved for command characters or reserved characters that wouldn't be shown (unless I'm super wrong!).
UTF-8 isn't the important factor, since all of the standard Unicode encodings (UTF-8, UTF-16, UTF-32) encode the same character space, just in different ways.
From your explanation I see you don't just want the 1,112,064 valid Unicode code points?
Unicode 6.0 and ISO/IEC 10646:2010 define 109,449 characters, but a handful of those are what you're calling "control characters". Which ones do or don't fall into that category depends on how you're counting. Copying and pasting may result in some characters being treated as identical to one another, or ignore altogether, depending on the OS and the programs doing the copying and pasting.
However because Unicode is forward compatible, some systems will correctly preserve characters which haven't yet been assigned. After all, just because you're running Windows XP and you copy and paste a document with characters that weren't standardised until 2009 doesn't mean you expect them to vanish. There could be a million or so extra possible characters by this way of thinking, although their visual appearance may be indistinguishable in some places.
Right single quotation mark (U+2019)
vs.
Apostrophe (U+0027)
What is the difference between these two characters?
I ran into this issue where I use CAtlString to load a string from a resource file, and on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations. The U+2019 character appears in strings in my resource file that I copied from Word, and U+0027 appears in stirngs that I hand coded. Why does LoadString (sometimes) choke on this?
What is the difference between these two characters?
Arguable!
Going by the names, one would imagine that the curly ‹’› is only for use as a quotation mark, and that the straight ‹'› is only for use as a real apostrophe, an indicator of omitted letters.
However traditional typesetting practice in English is always to use a curly ‹’› to render an apostrophe. Personally—and I may be alone here—I don't like this. It can make for more ambiguous reading:
“He said, ‘It’s fish ’n’ chips’...”
with the apostrophes being straight it's (marginally) clearer where the quotation ends:
“He said, ‘It's fish 'n' chips’...”
and the apostrophe being ‘straight’ makes more sense to me because its purpose of indicating omitted letters has no inherent directionality, whereas quotation marks are clearly asymmetrical in purpose.
In traditional ASCII, of course, there are no smart quotes, so the apostrophe is always used for both...
on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations.
Here you are meeting the horror of the ‘ANSI’ code page. This is a default character encoding that is different across different Windows install locales. So on a machine in the Western region, you get different results when you read a resource to when you read it on a Japanese Windows.
It is highly unfortunate that Windows has varying default code pages instead of using a single global encoding like UTF-8, but it's too late to fix now. If you compile your whole application as a Unicode app (so you'll be using LoadStringW rather than LoadStringA) then you can cope with non-ASCII characters like the smart quotes much better.
If you can't move to a Unicode application you're a bit stuck. You won't be able to handle non-ASCII characters like the smart quotes globally, so stick with ASCII characters like the straight apostrophe ‹'› alone.
The U+2019 character appears in strings in my resource file that I copied from Word
Yes, Word has an annoying AutoCorrect feature that replaces all apostrophes you type with smart quotes. This is especially undesirable when you are dealing with code, where ‹’› will break the program; but it's also wrong even for plain old English, as it's not possible to correctly guess the desired direction of the quote. (It'll get one of the apostrophes in “fish 'n' chips” the wrong way round, for example.)
I suggest turning off the automatic-replace-with-smart-quotes feature. If you want the smart quotes, it's better to type them deliberately. Unfortunately they are inconvenient to type on most keyboard layouts, often requiring obscure Alt+numpad sequences. Personally I use this one to drop them onto Alt+[] keys.
Historically, single-quote and double-quote come in pairs, left (open) and right (close).
For many years the character sets of computers were limited, having a single form of each.
Now, with the advent of Unicode, the full forms are available, but support for them is still limited. Programming languages still use the simple forms, and the full forms can still cause problems.
I have been looking for a way of modifying static strings stored in Windows .exe files in the .rdata section, however I haven't found a real way to do so yet.
The whole thing is too complicated to do by hand (in this case by a HEX editor) and so I wanted to know if you have a solution to do so.
What is complicated about doing it in a hex editor? One 'gotcha' that might be tripping you up is that you have to maintain each string's original length. You can do so with spaces at the end or (sometimes) by null-terminating it early, depending on how it's accessed in the executable.
If you really want to get tricky, you can try finding every cross reference to said string in the code and modify the length parameter passed to functions that use it.