Bug with find.exe? - windows

In C++ we have a method to search for text in a file. It works by reading the file to a variable, and using strstr. But we got into trouble when the file got very large.
I thought I could solve this by calling find.exe using _popen. It works fine, except when these conditions are all true:
The file is of type unicode (BOM=FFFE)
The file is EXACTLY 4096 bytes
The text you are searching for is the last text in the file
To recreate, you can do this:
Open notepad
Insert 2046 X's then an A at the end
Save as test.txt, encoding = "unicode"
Verify that file is exactly 4096 bytes
Open a command prompt and type: find "A" /c test2.txt -> No hits
I also tried this:
Add or remove an X, and you will get a hit (file is not 4096 bytes anymore)
Save as UTF-8 (and add enough X's so that the file is 4096 bytes again), and you get a hit
Search for something in the middle of the file (file still unicode and 4096 bytes), and you get a hit.
Is this a bug, or is there something I'm missing?

Very interesting bug.
This question caused me to do some experiments on XP and Win 7 - the behaviors are different.
XP
ANSI - FIND cannot read past 1023 characters (1023 bytes) on a single line. FIND can match a line that exceeds 1023 characters as long as the search string matches before the 1024th. The matching line printout is truncated after 1023 characters.
Unicode - FIND cannot read past 1024 characters (2048 bytes) on a single line. FIND can match a line that exceeds 1024 characters as long as the search string matches before the 1025th. The matching line printout is truncated after 1024 characters.
I find it very odd that the line limits for Unicode and ANSI on XP are not the same number of bytes, nor is it a simple multiple. The Unicode limit expressed as bytes is 2 times the limit for ANSI plus 1.
Note: truncation of matching long lines also truncates the new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.
Window 7
ANSI - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 4095 characters (4095 bytes) is truncated after 4095 characters. FIND can successfully search past 4095 characters on a line, it just can't display all of them.
Unicode - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 2047 characters (4094 bytes) is truncated after 2047 characters. FIND can successfully search past 2047 characters on a line, it just can't display all of them.
Since Unicode byte lengths are always a multiple of 2, and the max ANSI displayable length is an odd number, it makes sense that the max displayable line length in bytes is one less for Unicode than for ANSI.
But then there is also the weird Unicode bug. If the Unicode file length is an exact multiple of 4096 bytes, then the last character cannot be searched or printed. It does not matter if the file contains a single line or multiple lines. It only depends on the total file length.
I find it interesting that the multiple of 4096 bug is within one of the max printable line length (in bytes). But I don't know if there is a relationship between those behaviors or if it is simply coincidence.
Note: truncation of matching long lines also truncates any new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.

Related

Print lines around position in the file

I'm importing a big csv (5gb) file to the BiqQuery and I had information about an error in the file and its position — specified as a byte offset from the start of the file (for example, 134683757). I'd like to look at lines around this error position.
Some example lines of the file:
field1, field2, field3
abc, bcd, efg
...
dge, hfr, kdf,
dgj, "a""a", fbd # in this line is an invalid csv element and I get error, let's say on the position 134683757
skd, frd, lqw
...
asd, fij, fle
I need some command to show lines around error like
dge, hfr, kdf,
dgj, "a""a", fbd
skd, frd, lqw
I tried sed and awk but I didn't find any simple solution.
It was definitely not clear from the original version of the question that you only got a byte offset from the start of the file.
You need to get a better position from the software generating the error; the developer was lazy in reporting an unusable number. It is reasonable to request a line number (and preferably offset within the line), rather than (or as well as) the byte offset from the start.
Assuming that the number is a byte position in the file, that gets tricky. Most Unix utilities work with lines (of variable length). I'd be tempted to write some C code to do the job, but that might be beyond you (and no shame in that).
Failing that, your best is likely the dd command. If the number reported is 134683757, then I'd guess that your lines are probably not more than 1 KiB each (adjust numbers if they're bigger, or smaller), and then use:
dd if=big.csv of=extract.csv bs=1 skip=$((134683757 - 3 * 1024)) count=6144
echo >> extract.csv
You'd then look at extract.csv. The raw dd output probably won't have a newline at the end of the last line (the echo >>extract.csv fixes that). The output will probably start part way through a record and end part way through another record. However, you're likely to have the relevant information, as well as some irrelevant information. As I said, adjust the numbers to suit your exact situation.
The trickiest part is identifying exactly where the byte offset is in the file you get. With custom C code, that can be provided easily (more easily). With the output from dd, you have to do the calculation yourself.
awk -v offset=$((134683757 - 3 * 1024)) '
{ printf "%9d: %s\n", offset, $0; offset += length($0) + 1 }
' extract.cvs
That takes the starting offset from the dd command, and prefixes the (remnants of) the first line with that number and the data; it then adds the length to the offset plus one for the newline that wasn't counted, and continues to the end of the file. That gives you the start offset for each line in the extracted data. You can see where your actual start was by looking at the offsets — you should be able to identify which record that was.
You could use a variant of this Awk script that reads the whole file line by line, and tracks the offset (as well as the line numbers) and prints the data when it gets to the vicinity of where you have the problem.
In times long past, I had to deal with data from 1/2 inch mag tapes (those big circular tapes you see in old movies) where the files generated on a mainframe seemed sanely formatted for the first few tens of megabytes, but then the format changed to some alternative format for a few megabytes, and then reverted to the original format once more. I never did find out why; I just learned how to deal with it. Trial and error!

File seek with two-byte characters

I'm writing small log parser, which should find some tags in files.
Files are large (512mb) and have the following structure:
[2018.07.10 00:30:03:125] VersionInfo\886
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingTime\16
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingData\397
...some data...
[2018.07.10 00:30:03:749][TraceID: 8HRWSI105YVO91]->OutgoingData\26651
...somedata...
Each block IncomingTime, IncomingData, OutgoingData, etc. has block size (characters count, not bytes) at the end 886, 16, 397, 26651. Some blocks are very large and can't be read without large buffer (if i use bufio). I want to skip unnecessary blocks using file.Seek.
The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?
The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?
That's actually impossible. As you've described the file format, both of the following are possible:
...VersionInfo\1
[ 20 ]
...VersionInfo\1
[ C2 A0 ]
If you've just read the newline and you know you need to read one character, you know it's somewhere between 1 and 2 bytes (UTF-8 characters can go up to 4 bytes even) but not which, and blindly launching forward some number of bytes without inspecting the intermediate data won't work. The pathological case is a larger block, where the first half has many multi-byte characters and the last half has text that happens to look like one of your entry headers.
With this file format you're forced to read it a character at a time.

Unexpected Blank lines in python output to Windows console

I have a little program that prints out a direcory structure.
It works fine except when the direcory names contain german umlaut characters.
In this case int prints a blank line after the directory line.
I'm running Python 3.50 on Windows 7 64bit.
This Code ...
class dm():
...
def print(self, rootdir=None, depth=0):
if rootdir is None:
rootdir = self.initialdir
if rootdir in self.dirtree:
print('{}{} ({} files)'.format(' '*depth,
rootdir,
len(self.dirtree[rootdir]['files'])))
for _dir in self.dirtree[rootdir]['dirs']:
self.print(os.path.join(rootdir, _dir), depth+1)
else:
pass
...produces the following output:
B:\scratch (11 files)
B:\scratch\Test1 (3 files)
B:\scratch\Test1 - Kopie (0 files)
B:\scratch\Test1 - Übel (0 files)
B:\scratch\Test2 (3 files)
B:\scratch\Test2\Test21 (0 files)
This is so with codepage set to 65001. If i change the codepage to e.g. 850 then the blank line disappears but of course the "Ü" isn't printed correctly.
The structure self.dirtree is a dict of dicts of lists, is parsed with os.walk and seems OK.
Python or Windows? Any suggestions?
Marvin
There are several bugs when using codepage 65001 (UTF-8) -- all of which are due to the Windows console (i.e. conhost.exe), not Python. The best solution is to avoid this buggy codepage, and instead use the wide-character API, such as by loading win_unicode_console.
You're experiencing a bug that exists in the legacy console that was used prior to Windows 10. (It's still available in Windows 10 if you select the option "Use legacy console".) The console decodes the UTF-8 buffer to UTF-16 and reports back that it writes b'\xc3\x9c' (i.e. "Ü" encoded as UTF-8) as one character, but it's supposed to report back the number of bytes that it writes, which is two. Python's buffered sys.stdout sees that apparently one byte wasn't written, so it dutifully writes the last byte of the line again, which is b'\n'. That's why you get an extra newline. The result can be far worse if a written buffer has many non-ASCII characters, especially codes above U+07FF that get encoded as three UTF-8 bytes.
There's a worse bug if you try to paste "Ü" into the interactive REPL. This bug is still present even in Windows 10. In this case a process is reading the console's wide-character (UTF-16) input buffer encoded as UTF-8. The console does the conversion via WideCharToMultiByte with a buffer that assumes one Unicode character is a single byte in the target codepage. But that's completely wrong for UTF-8, in which one UTF-16 code may map to as many as three bytes. In this case it's two bytes, and the console only allocates one byte in the translation buffer. So WideCharToMultiByte fails, but does the console try to increase the translation buffer size? No. Does it fail the call? No. It actually returns back that it 'successfully' read 0 bytes. To Python's REPL that signals EOF (end of file), so the interpreter just exits as if you had entered Ctrl+Z at the prompt.

Crack some exe file - how to remove bytes

Today I am trying to remove some bytes from an EXE file.
Inside the EXE I found a path to a file that the EXE needs to load. I want to change the path, and to do that I have to remove some ../../ characters. When I do that and save the file, it looses its icon and a 'win32 unknow format error' is displayed when I try to execute it.
If I don't remove those bytes but replace them by 0, the icon is not lost, and the file looks right. Yet, the path is incorrect.
So, it looks like when I remove bytes, position of other information inside the file is lost, including resources (the icon). After removeing those bytes, I need to add other 6 bytes, to keep the same size and position of other data. Where should I do that? If I add those bytes at the end of the file, it doesn't work. Could you give me some clues? Thanks!
After removing the ../../ from the start of the string, stick six 0 bytes at the end of the string (I'm assuming you can identify the end manually). That way the offset of everything in the file remains the same. By removing the 6 bytes entirely, the offset of things after the string would change. By replacing the 6 bytes with 0s, the offset of the string would change (it would now really be at wherever it was + 6).

Maximum Length of Command Line String

In Windows, what is the maximum length of a command line string? Meaning if I specify a program which takes arguments on the command line such as abc.exe -name=abc
A simple console application I wrote takes parameters via command line and I want to know what is the maximum allowable amount.
From the Microsoft documentation: Command prompt (Cmd. exe) command-line string limitation
On computers running Microsoft Windows XP or later, the maximum length of the string that you can use at the command prompt is 8191 characters.
Sorry for digging out an old thread, but I think sunetos' answer isn't correct (or isn't the full answer). I've done some experiments (using ProcessStartInfo in c#) and it seems that the 'arguments' string for a commandline command is limited to 2048 characters in XP and 32768 characters in Win7. I'm not sure what the 8191 limit refers to, but I haven't found any evidence of it yet.
As #Sugrue I'm also digging out an old thread.
To explain why there is 32768 (I think it should be 32767, but lets believe experimental testing result) characters limitation we need to dig into Windows API.
No matter how you launch program with command line arguments it goes to ShellExecute, CreateProcess or any extended their version. These APIs basically wrap other NT level API that are not officially documented. As far as I know these calls wrap NtCreateProcess, which requires OBJECT_ATTRIBUTES structure as a parameter, to create that structure InitializeObjectAttributes is used. In this place we see UNICODE_STRING. So now lets take a look into this structure:
typedef struct _UNICODE_STRING {
USHORT Length;
USHORT MaximumLength;
PWSTR Buffer;
} UNICODE_STRING;
It uses USHORT (16-bit length [0; 65535]) variable to store length. And according this, length indicates size in bytes, not characters. So we have: 65535 / 2 = 32767 (because WCHAR is 2 bytes long).
There are a few steps to dig into this number, but I hope it is clear.
Also, to support #sunetos answer what is accepted. 8191 is a maximum number allowed to be entered into cmd.exe, if you exceed this limit, The input line is too long. error is generated. So, answer is correct despite the fact that cmd.exe is not the only way to pass arguments for new process.
In Windows 10, it's still 8191 characters...at least on my machine.
It just cuts off any text after 8191 characters. Well, actually, I got 8196 characters, and after 8196, then it just won't let me type any more.
Here's a script that will test how long of a statement you can use. Well, assuming you have gawk/awk installed.
echo rem this is a test of how long of a line that a .cmd script can generate >testbat.bat
gawk 'BEGIN {printf "echo -----";for (i=10;i^<=100000;i +=10) printf "%%06d----",i;print;print "pause";}' >>testbat.bat
testbat.bat

Resources