I have a little program that prints out a direcory structure.
It works fine except when the direcory names contain german umlaut characters.
In this case int prints a blank line after the directory line.
I'm running Python 3.50 on Windows 7 64bit.
This Code ...
class dm():
...
def print(self, rootdir=None, depth=0):
if rootdir is None:
rootdir = self.initialdir
if rootdir in self.dirtree:
print('{}{} ({} files)'.format(' '*depth,
rootdir,
len(self.dirtree[rootdir]['files'])))
for _dir in self.dirtree[rootdir]['dirs']:
self.print(os.path.join(rootdir, _dir), depth+1)
else:
pass
...produces the following output:
B:\scratch (11 files)
B:\scratch\Test1 (3 files)
B:\scratch\Test1 - Kopie (0 files)
B:\scratch\Test1 - Übel (0 files)
B:\scratch\Test2 (3 files)
B:\scratch\Test2\Test21 (0 files)
This is so with codepage set to 65001. If i change the codepage to e.g. 850 then the blank line disappears but of course the "Ü" isn't printed correctly.
The structure self.dirtree is a dict of dicts of lists, is parsed with os.walk and seems OK.
Python or Windows? Any suggestions?
Marvin
There are several bugs when using codepage 65001 (UTF-8) -- all of which are due to the Windows console (i.e. conhost.exe), not Python. The best solution is to avoid this buggy codepage, and instead use the wide-character API, such as by loading win_unicode_console.
You're experiencing a bug that exists in the legacy console that was used prior to Windows 10. (It's still available in Windows 10 if you select the option "Use legacy console".) The console decodes the UTF-8 buffer to UTF-16 and reports back that it writes b'\xc3\x9c' (i.e. "Ü" encoded as UTF-8) as one character, but it's supposed to report back the number of bytes that it writes, which is two. Python's buffered sys.stdout sees that apparently one byte wasn't written, so it dutifully writes the last byte of the line again, which is b'\n'. That's why you get an extra newline. The result can be far worse if a written buffer has many non-ASCII characters, especially codes above U+07FF that get encoded as three UTF-8 bytes.
There's a worse bug if you try to paste "Ü" into the interactive REPL. This bug is still present even in Windows 10. In this case a process is reading the console's wide-character (UTF-16) input buffer encoded as UTF-8. The console does the conversion via WideCharToMultiByte with a buffer that assumes one Unicode character is a single byte in the target codepage. But that's completely wrong for UTF-8, in which one UTF-16 code may map to as many as three bytes. In this case it's two bytes, and the console only allocates one byte in the translation buffer. So WideCharToMultiByte fails, but does the console try to increase the translation buffer size? No. Does it fail the call? No. It actually returns back that it 'successfully' read 0 bytes. To Python's REPL that signals EOF (end of file), so the interpreter just exits as if you had entered Ctrl+Z at the prompt.
Related
I was trying to implement a unified input interface using Windows API function ReadFile for my application, which should be able to handle both console input and redirection. It didn't work as expected with console input containing multibyte (like CJK) characters.
According to Microsoft Documentation, for console input handles, ReadFile just behaves like ReadConsoleA. (FYI, results are encoded in console's current code page, so A family console functions are acceptable. And there's no ReadFileW as ReadFile works on bytes.) The third and fourth arguments in ReadFile is nNumberOfBytesToRead and lpNumberOfBytesRead respectively, but they are nNumberOfCharsToRead and lpNumberOfCharsRead in ReadConsole. To find out the exact mechanism, I did the following test:
BYTE buf[8];
DWORD len;
BOOL f = ReadFile(in, buf, 4, &len, NULL);
if (f) {
// Print buf, len
ReadConsoleW(in, buf, 4, &len, NULL); // check count of remaining characters
// Print len
}
For input like 字, len is set to 4 first (character plus CRLF), indicating the arguments are counting bytes.
For 文字 or a字, len keeps 4 and only the first 4 bytes of buf are used at first, but the second read does not get the CRLF. Only when more than 3 characters are input will the second read get unread LF, then CR. It means that ReadFile is actually consuming up to 4 logical characters, and discarding the part of input after the first 4 bytes.
The behavior of ReadConsoleA is identical to ReadFile.
Obviously, this is more likely to be a bug than design. I did some searches and found a related feedback dating back to 2009. It seems that ReadConsoleA and ReadFile did read data fully from console input, but as it was inconsistent with ReadFile specifications and could cause severe buffer overflow that threatened system processes, Microsoft did a makeshift repair, by simply discarding excess bytes, ignoring support for multibyte charsets. (This is an issue about the behavior after that, limiting buffer to 1 byte.)
Currently the only practical solution I have come up with to make input correct is to check whether the input handle is a console, and process it differently using ReadConsoleW if so, which adds complexity to the implementation. Are there other ways to get it correct?
Maybe I could still keep ReadFile, by providing a buffer large enough to hold any input at one time. However, I don't have any ideas on how to check or set the input buffer size. (I can only enter 256 characters (254 plus CRLF) in my application on my computer, but cmd.exe allows to enter 8,192 characters, so this is really a problem.) It will also be helpful if more information about this can be provided.
Ps.: Maybe _getws could also help, but this question is about Windows API, and my application needs to use some low-level console functions.
I'm capturing the output of an external program as described here. Now I am wondering about the text encoding to expect when reading the pipe's data into a memory buffer using ReadFile().
External programs can write to stdout in various ways, for example:
using printf()
using wprintf()
using WriteConsoleA()
using WriteConsoleW()
...
So will I get UTF-16 text if a program uses wprintf() or WriteConsoleW() to write to stdout and 8-bit text (depending on the default console encoding) if a program uses printf() or WriteConsoleA()? Or what encoding will text captured from an external program be in?
TD;DR: It depends on the program.
WriteConsoleA/W cannot write to pipes, only to the console so they are not a factor here.
A program that uses WriteFile directly will write in whatever format the data given to the function is. Most likely the active ANSI codepage, the OEM codepage, or UTF16-LE.
A program that uses the wchar_t print functions and the Microsoft C run-time can choose the output format (_O_WTEXT (UTF-16? with BOM), _O_U8TEXT, or _O_U16TEXT) by calling _setmode or _wsopen.
Most programs are not going to output UTF16-LE unless you give them a switch to enable this feature (cmd.exe /U etc.). The best approach if you know nothing about the programs but you prefer Unicode would be to look for a BOM and if it is not present, try to parse as UTF-8 and if/when that fails, fall back to the ANSI or OEM codepage. If you have a fair amount of buffering you could also try to detect UTF-16 without a BOM with IsTextUnicode.
If you are attached to a console you can try to influence the other process by calling SetConsoleOutputCP but I doubt anyone will listen.
See also:
Myth busting in the console
Conventional wisdom is retarded, aka What the ##%&* is _O_U16TEXT?
I have a very weird issue when my application is run under "Windows Vista Compatibility mode" (right-click the EXE, enable compatibility mode and select windows vista).
The issue is the return buffer length value from the "RegEnumValue" function returns a different value.
For example, with a registry value of "Zoom Player MAX" (15 characters):
With compatibility mode diabled, RegEnumValue's "lpcbData" field return a value of 16 (including the trailing null termination).
With compatibility mode enabled, RegEnumValue's "lpcbData" field return a value of 15 (doesn't include the trailing null termination).
Is there a work-around/patch for this that doesn't require changing my string conversion code?
It should not matter. When reading from the Registry using low-level classic functions, you must be able to handle strings with and without null-terminators:
Beware of non-null-terminated registry strings
The easy way to do this is to secretly allocate one extra character that you don't tell the API about when reading, and then append the '\0' character to the end of however many characters it returns.
Newer functions like RegGetValue() handle this for you.
Say I have PROGRAM.ASM - I have the following in the data segment:
.data
Filename db 'file.txt', 0
Fhndl dw ?
Buffer db ?
I want 'file.txt' to be dynamic I guess? Once compiled, PROGRAM.exe needs to be able to accept a file name via the command line:
c:\> PROGRAM anotherfile.txt
EXECUTION GOES HERE
How do I enable this? Thank you in advance.
DOS stores the command line in a legacy structure called the Program Segment Prefix ("PSP"). And I do mean legacy. This structure was designed to be backwards-compatible with programs ported from CP/M.
Where's the PSP?
You know how programs built as .COM files always start with ORG 100h? The reason for that is precisely that - for .COM programs - the PSP is always stored at the beginning of the code segment (at CS:0h). The PSP is 0FFh bytes long, and the actual program code starts right after that (that is, at CS:100h).
The address is also conveniently available at DS:00h and ES:00h, since the key characteristic of the .COM format is that all the segment registers start with the same value (and a COM program typically never changes them).
To read the command line from a .COM program, you can pick its length at CS:80h (or DS:80h, etc. as long as you haven't changed those registers). The Command Line starts at CS:81h and takes the rest of PSP, ending with a Carriage Return (0Dh) as a terminator, so the command line is never more than 126 bytes long.
(and that is why the command line has been 126 bytes in DOS forever, despite the fact we all wished for years it could be made longer. Since WinNT uses provides a different mechanism to access the command line, the WinNT/XP/etc. command line doesn't suffer from this size limitation).
For an .EXE program, you can't rely on CS:00h because the startup code segment can be just about anywhere in memory. However, when the program starts, DOS always stores the PSP at the base of the default data segment. So, at startup, DS:00h and ES:00h will always point to the PSP, for both .EXE and .COM programs.
If you didn't keep track of PSP address at the beginning of the program, and you change both DS and ES, you can always ask DOS to provide the segment value at any time, via INT 21h, function 62h. The segment portion of the PSP address will be returned in BX (the offset being of course 0h).
In C++ we have a method to search for text in a file. It works by reading the file to a variable, and using strstr. But we got into trouble when the file got very large.
I thought I could solve this by calling find.exe using _popen. It works fine, except when these conditions are all true:
The file is of type unicode (BOM=FFFE)
The file is EXACTLY 4096 bytes
The text you are searching for is the last text in the file
To recreate, you can do this:
Open notepad
Insert 2046 X's then an A at the end
Save as test.txt, encoding = "unicode"
Verify that file is exactly 4096 bytes
Open a command prompt and type: find "A" /c test2.txt -> No hits
I also tried this:
Add or remove an X, and you will get a hit (file is not 4096 bytes anymore)
Save as UTF-8 (and add enough X's so that the file is 4096 bytes again), and you get a hit
Search for something in the middle of the file (file still unicode and 4096 bytes), and you get a hit.
Is this a bug, or is there something I'm missing?
Very interesting bug.
This question caused me to do some experiments on XP and Win 7 - the behaviors are different.
XP
ANSI - FIND cannot read past 1023 characters (1023 bytes) on a single line. FIND can match a line that exceeds 1023 characters as long as the search string matches before the 1024th. The matching line printout is truncated after 1023 characters.
Unicode - FIND cannot read past 1024 characters (2048 bytes) on a single line. FIND can match a line that exceeds 1024 characters as long as the search string matches before the 1025th. The matching line printout is truncated after 1024 characters.
I find it very odd that the line limits for Unicode and ANSI on XP are not the same number of bytes, nor is it a simple multiple. The Unicode limit expressed as bytes is 2 times the limit for ANSI plus 1.
Note: truncation of matching long lines also truncates the new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.
Window 7
ANSI - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 4095 characters (4095 bytes) is truncated after 4095 characters. FIND can successfully search past 4095 characters on a line, it just can't display all of them.
Unicode - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 2047 characters (4094 bytes) is truncated after 2047 characters. FIND can successfully search past 2047 characters on a line, it just can't display all of them.
Since Unicode byte lengths are always a multiple of 2, and the max ANSI displayable length is an odd number, it makes sense that the max displayable line length in bytes is one less for Unicode than for ANSI.
But then there is also the weird Unicode bug. If the Unicode file length is an exact multiple of 4096 bytes, then the last character cannot be searched or printed. It does not matter if the file contains a single line or multiple lines. It only depends on the total file length.
I find it interesting that the multiple of 4096 bug is within one of the max printable line length (in bytes). But I don't know if there is a relationship between those behaviors or if it is simply coincidence.
Note: truncation of matching long lines also truncates any new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.