Issues with getline/file reading in Windows - visual-studio

I created some .txt files on my Mac (didn't think that would matter at first, but...) so that I could read them in the application I am making in (unfortunately) Visual Studio on a different computer. They are basically files filled with records, with the number of entries per row at the top, e.g.:
2
int int
age name
9 Bob
34 Mary
12 Jim
...
In the code, which I originally just made (and tested successfully) on the Mac, I attempt to read this file and similar ones:
Table TableFromFile(string _filename){ //For a database system
ifstream infile;
infile.open(_filename.c_str());
if(!infile){
cerr << "File " << _filename << " could not be opened.";
exit(1);
}
//Determine number attributes (columns) in table,
//which is number on first line of input file
std::string num;
getline(infile, num);
int numEntries = atoi(num.c_str());
...
...
In short, this causes a crash! As I looked into it, I found some interesting "Error reading characters of string" issues and found that numEntries is getting some crazy negative garbage value. This seems to be caused by the fact that "num", which should just be "2" as read from the first line, is actually coming out as "ÿþ2".
From a little research, it seems that these strange characters are formatting things...perhaps unicode/Mac specific? In any case, they are a problem, and I am wondering if there is a fast and easy way to make the text files I created on my Mac cooperate and behave in Windows just like they did in the Mac terminal. I tried connecting to a UNIX machine, putting a txt file there, running unix2dos on it, and put into back in VS, but to no avail...still those symbols at the start of the line! Should I just make my input files all over again in Windows? I am very surprised to learn that what you see is not always what you get when it comes to characters in a file across platforms...but a good lesson, I suppose.

As the commenter indicated, the bytes you're seeing are the byte order mark. See http://en.wikipedia.org/wiki/Byte_order_mark.
"ÿþ" is 0xFFFE, the UTF-16 "little endian" byte order mark. The "2" is your first actual character (for UTF-16, characters below 256 will be represented by bytes of the for 0xnn00;, where "nn" is the usual ASCII or UTF-8 code for that character, so something trying to read the bytes as ASCII or UTF-8 will do OK until it reaches the first null byte).
If you need to puzzle out the Unicode details of a text file the best tool I know of is the free SC Unipad editor (www.unipad.org). It is Windows-only but can read and write pretty much any encoding and will be able to tell you what there is to know about the file. It is very good at guessing the encoding.
Unipad will be able to open the file and let you save it in whatever encoding you want: ASCII, UTF-8, etc.

Related

Rust println! prints weird characters under certain circumstances

I'm trying to write a short program (short enough that it has a simple main function). First, I should list the dependency in the cargo.toml file:
[dependencies]
passwords = {version = "3.1.3", features = ["crypto"]}
Then when I use the crate in main.rs:
extern crate passwords;
use passwords::hasher;
fn main() {
let args: Vec<String> = std::env::args().collect();
if args.len() < 2
{
println!("Error! Needed second argument to demonstrate BCrypt Hash!");
return;
}
let password = args.get(1).expect("Expected second argument to exist!").trim();
let hash_res = hasher::bcrypt(10, "This_is_salt", password);
match hash_res
{
Err(_) => {println!("Failed to generate a hash!");},
Ok(hash) => {
let str_hash = String::from_utf8_lossy(&hash);
println!("Hash generated from password {} is {}", password, str_hash);
}
}
}
The issue arises when I run the following command:
$ target/debug/extern_crate.exe trooper1
And this becomes the output:
?sC�M����k��ed from password trooper1 is ���Ka .+:�
However, this input:
$ target/debug/extern_crate.exe trooper3
produces this:
Hash generated from password trooper3 is ��;��l�ʙ�Y1�>R��G�Ѡd
I'm pretty content with the second output, but is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten? And is there code I could use to prevent this?
Note: Code was developed in Visual Studio Code in Windows 10, and was compiled and run using an embedded Git Bash Terminal.
P.S.: I looked at similar questions such as Rust println! problem - weird behavior inside the println macro and Why does my string not match when reading user input from stdin? but those issues seem to be issues with new-line and I don't think that's the problem here.
To complement the previous, the answer to your question of "is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten?" is:
let str_hash = String::from_utf8_lossy(&hash);
The reason's in the name: from_utf8_lossy is lossy. UTF8 is a pretty prescriptive format. You can use this function to "decode" stuff which isn't actually UTF8 (for whatever reason), but the way it will do this decoding is:
replace any invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER, which looks like this: �
And so that is what the odd replacement you get is: byte sequences which can not be decoded as UTF8, and are replaced by the "replacement character".
And this is because hash functions generally return random-looking binary data, meaning bytes across the full range (0 to 255) and with no structure. UTF8 is structured and absolutely does not allow such arbitrary data so while it's possible that a hash will be valid UTF8 (though that's not very useful) the odds are very very low.
That's why hashes (and binary data in general) are usually displayed in alternative representations e.g. hex, base32 or base64.
You could convert the hash to hex before printing it to prevent this
Neither of the other answers so far have covered what caused the Hash generated part of the answer to get overwritten.
Presumably you were running your program in a terminal. Terminals support various "terminal control codes" that give the terminal information such as which formatting they should use to output the text they're showing, and where the text should be output on the screen. These codes are made out of characters, just like strings are, and Unicode and UTF-8 are capable of representing the characters in question – the only difference from "regular" text is that the codes start with a "control character" rather than a more normal sort of character, but control characters have UTF-8 encodings of their own. So if you try to print some randomly generated UTF-8, there's a chance that you'll print something that causes the terminal to do something weird.
There's more than one terminal control code that could produce this particular output, but the most likely possibility is that the hash contained the byte b'\x0D', which UTF-8 decodes as the Unicode character U+000D. This is the terminal control code "CR", which means "print subsequent output at the start of the current line, overwriting anything currently there". (I use this one fairly frequently for printing progress bars, getting the new version of the progress bar to overwrite the old version of the progress bar.) The output that you posted is consistent with accidentally outputting CR, because some random Unicode full of replacement characters ended up overwriting the start of the line you were outputting – and because the code in question is only one byte long (most terminal control codes are much longer), the odds that it might appear in randomly generated UTF-8 are fairly high.
The easiest way to prevent this sort of thing happening when outputting arbitrary UTF-8 in Rust is to use the Debug implementation for str/String rather than the Display implementation – it will output control codes in escaped form rather than outputting them literally. (As the other answers say, though, in the case of hashes, it's usual to print them as hex rather than trying to interpret them as UTF-8, as they're likely to contain many byte sequences that aren't valid UTF-8.)

File with first bit of every byte set to 0

I was given a file that seems to be encoded in UTF-8, but every byte that should start with 1 starts with 0.
E.g. in place where one would expect polish letter 'ę', encoded in UTF-8 as \o304\o231, there is \o104\o031. Or, in binary, there is 01000100:00011001 instead of 11000100:10011001.
I assume that this was not done on purpose by evil file creator who enjoys my headache, but rather is a result of some erroneous operations performed on a correct UTF-8 file.
The question is: what "reasonable" operations could be the cause? I have no idea how the file was created, probably it was exported by some unknown software, than could have been compressed, uploaded, copied & pasted, converted to another encoding etc.
I'll be greatful for any idea : )

GDB comparing an existent character with a non-existent character?

I am currently trying to hone my skills in reading assembly in GDB and I ran into some weirdness when trying to read a character in GDB, and I am not sure what is going on.
For some context, the file I am looking at is compiled, there is no .c function though the code was compiled in c. It is essentially a 'bomb' assignment file, where specific inputs are required to get to the next section of the code, and this code comes from testing one of the inputs against whatever the answer is supposed to be.
The code that contains the character I am trying to read is as follows:
cmp -0x1(%rbp),%al
je 0x400acf <nextpartofcode>
I am trying to read the -0x1(%rbp), so I inputted print/c $rbp-1 to try to look at it and GDB printed: 175 '\257'. Assuming that this output meant that the comparion would succeed if put in the ASCII character 175, however, when I inputted the character (it looks like >>) it was shown as -62 '\302'.
I also tried reading the value as an integer, octal value, decimal value, string, and hex value with the same amount of success, and I am at a lost at what else I can try. What exactly is happening here? Am I looking in the wrong place (ie is -0x1(%rbp) not $rbp-1)? Am I reading the value as something it isn't (I was told it should be a char, but is it something else)? Should I be looking somewhere else for the value? I'm stuck, and I would appreciate any guidance.
$rbp-1 is the address; you need to dereference it. (Use GDB's x command to eXamine memory at an address, or use C syntax to dereference a pointer for print, like p /c *(char*)($rbp-1).
You were printing the low byte of the address as a char.

A hint for end of ASCII data in a binary file

I'm developing a software that stores its data in a binary file format. However, as a courtesy to innocent shell users that might cat to inspect the contents of such a file, I'm thinking of having an ASCII-compatible "magic string" in the start of the file that tells the name and the version of the binary format.
I'm thinking of having at least ten rows (\n) in the message so that head by default settings doesn't hit the binary part.
Now, I wonder if there is any control character or escape code that would hint to the shell that the following content isn't interpretable as printable text, and should be just ignored? I tried 0x00 (the null byte) and 0x04 (ctrl-D) but they seem to be just ignored when catting the file.
Cat regards a file as text. There is no way you can trigger an end-of-file, since EOF is not actually any character.
The other way around works of course; specifying a format that only start reading binary format from a certain character on.

Why does gtk+ say "invalid utf-8" when debugging on eclipse?

I have been creating a gtk+ application in eclipse. At a point in the code, an alert dialogue is displayed using code similar to the gtk+ hello world. When I run this program, the dialogue ends up displaying the content of 'words' as expected, but the program crashes when I close the dialogue. I am new to c, so I ran the program with debug expecting to find some simple mistake. However, when i ran with debug, the dialogue displayed 'words' preceded by many null characters and logged the message.
Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()
This new problem is confusing, and to add to the confusion, the program also did not crash when the dialogue was closed.
In summary, when I run the code, the text is fine, and the program crashes. When I debug the code, the text is invalid, and the program does not crash.
The text in the dialogue is generated with the following code:
char* answerBuffer = (char*)malloc(strlen(s)+strlen(words)+1);
strcat(answerBuffer,words);
char* answer = (char*)malloc(strlen(answerBuffer)+1);
g_strlcpy(answer,answerBuffer,strlen(answerBuffer)+1);
return answer;
as the code executes, the length of answerBuffer is 320 and words is a char* argument set to "a,b,c,d". I am running this on windows xp through eclipse with the minGW compiler using gtk+ 2.24. Can anyone tell me how to debug/fix this?
ps. 's' contains text from a file followed by either one or twelve null characters (one if I run, twelve if I debug)
Given the code you've supplied, this line is the problem:
strcat(answerBuffer,words);
Why? Because you don't know what is in answerBuffer. Malloc doesn't necessarily zero memory it returns to you, so answerBuffer contains essentially random bytes. You need to zero at least the first byte (so it looks like a zero length string), or use calloc() to allocate your buffer, which gives you zeroed memory.
Well, the odds are that the content of 's' isn't a valid UTF-8 sequence.
Look up what UTF-8 is about in case that confuses you. Or make sure your text file conains only ASCII characters for simplicity.
If that doesn't help you, then you're probably messing up somewhere with the file read or possible encoding conversions.

Resources