What character encoding is this? - utf-8

I'm interfacing with an Oracle DB, which has some messed up encoding (ASCII7 according to the db properties, but actually encodes Korean characters).
When I get some of the Korean strings from the resultSet, and look at the bytes, it turns out that they correspond exactly to this file (I found by googling some of the byte sequences): http://211.115.85.9/files/raw3.txt
Kinda spooky, as it seems to be the ONLY thing on the internet that has anything about this particular encoding...
The file, when viewed with EditPlus3, shows me 3 columns.
The first column is an alphabetical listing of Korean characters. The second is the strange encoding I'm finding from looking at the Java strings passed from the Oracle DB. The third one is UTF8.
I'm trying to figure out what the middle column is encoded in. Can anyone point me in the right direction?
(I really don't want to have to actually read from this file every time I need to call a DB...)

It is EUC-KR (or a similar) encoded data, interpreted as another 1-byte encoding (ISO-8859-1 or similar) and encoded using UTF-8.
In other words: it's ill-encoded data, but might be salvagable:
byte[] bytes = new byte[] { (byte) 0xc2, (byte) 0xb0, (byte) 0xc2, (byte) 0xa1 };
String str = new String(bytes, "UTF-8");
bytes = str.getBytes("ISO-8859-1");
str = new String(bytes, "EUC-KR");
System.out.println(str);
This prints 가 on my system.
I've found this PDF file which explains the problem (and how it happend) in more detail.

It is UTF-8 encoding:
가 c2b0c2a1 eab080
각 c2b0c2a2 eab081
간 c2b0c2a3 eab084
갇 c2b0c2a4 eab087
...
I don't know the meaning of the middle column, but the third column is a hex-representation of the Hangul in the first row.
Watch the file with a hex editor, this may help.
Good luck! :)

I wrote a little script and decoded the middle column of the first two lines brute force.
The following four results are Hangul but I'm not sure, if they make sense:
utf_16_be => 슰슡 슰슢
johab => 춿춰 춿춱
euc_kr => 째징 째짖
cp949 => 째징 째짖
I hope that helped. Have a nice day! :)

Related

Rust println! prints weird characters under certain circumstances

I'm trying to write a short program (short enough that it has a simple main function). First, I should list the dependency in the cargo.toml file:
[dependencies]
passwords = {version = "3.1.3", features = ["crypto"]}
Then when I use the crate in main.rs:
extern crate passwords;
use passwords::hasher;
fn main() {
let args: Vec<String> = std::env::args().collect();
if args.len() < 2
{
println!("Error! Needed second argument to demonstrate BCrypt Hash!");
return;
}
let password = args.get(1).expect("Expected second argument to exist!").trim();
let hash_res = hasher::bcrypt(10, "This_is_salt", password);
match hash_res
{
Err(_) => {println!("Failed to generate a hash!");},
Ok(hash) => {
let str_hash = String::from_utf8_lossy(&hash);
println!("Hash generated from password {} is {}", password, str_hash);
}
}
}
The issue arises when I run the following command:
$ target/debug/extern_crate.exe trooper1
And this becomes the output:
?sC�M����k��ed from password trooper1 is ���Ka .+:�
However, this input:
$ target/debug/extern_crate.exe trooper3
produces this:
Hash generated from password trooper3 is ��;��l�ʙ�Y1�>R��G�Ѡd
I'm pretty content with the second output, but is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten? And is there code I could use to prevent this?
Note: Code was developed in Visual Studio Code in Windows 10, and was compiled and run using an embedded Git Bash Terminal.
P.S.: I looked at similar questions such as Rust println! problem - weird behavior inside the println macro and Why does my string not match when reading user input from stdin? but those issues seem to be issues with new-line and I don't think that's the problem here.
To complement the previous, the answer to your question of "is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten?" is:
let str_hash = String::from_utf8_lossy(&hash);
The reason's in the name: from_utf8_lossy is lossy. UTF8 is a pretty prescriptive format. You can use this function to "decode" stuff which isn't actually UTF8 (for whatever reason), but the way it will do this decoding is:
replace any invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER, which looks like this: �
And so that is what the odd replacement you get is: byte sequences which can not be decoded as UTF8, and are replaced by the "replacement character".
And this is because hash functions generally return random-looking binary data, meaning bytes across the full range (0 to 255) and with no structure. UTF8 is structured and absolutely does not allow such arbitrary data so while it's possible that a hash will be valid UTF8 (though that's not very useful) the odds are very very low.
That's why hashes (and binary data in general) are usually displayed in alternative representations e.g. hex, base32 or base64.
You could convert the hash to hex before printing it to prevent this
Neither of the other answers so far have covered what caused the Hash generated part of the answer to get overwritten.
Presumably you were running your program in a terminal. Terminals support various "terminal control codes" that give the terminal information such as which formatting they should use to output the text they're showing, and where the text should be output on the screen. These codes are made out of characters, just like strings are, and Unicode and UTF-8 are capable of representing the characters in question – the only difference from "regular" text is that the codes start with a "control character" rather than a more normal sort of character, but control characters have UTF-8 encodings of their own. So if you try to print some randomly generated UTF-8, there's a chance that you'll print something that causes the terminal to do something weird.
There's more than one terminal control code that could produce this particular output, but the most likely possibility is that the hash contained the byte b'\x0D', which UTF-8 decodes as the Unicode character U+000D. This is the terminal control code "CR", which means "print subsequent output at the start of the current line, overwriting anything currently there". (I use this one fairly frequently for printing progress bars, getting the new version of the progress bar to overwrite the old version of the progress bar.) The output that you posted is consistent with accidentally outputting CR, because some random Unicode full of replacement characters ended up overwriting the start of the line you were outputting – and because the code in question is only one byte long (most terminal control codes are much longer), the odds that it might appear in randomly generated UTF-8 are fairly high.
The easiest way to prevent this sort of thing happening when outputting arbitrary UTF-8 in Rust is to use the Debug implementation for str/String rather than the Display implementation – it will output control codes in escaped form rather than outputting them literally. (As the other answers say, though, in the case of hashes, it's usual to print them as hex rather than trying to interpret them as UTF-8, as they're likely to contain many byte sequences that aren't valid UTF-8.)

shorten chinesse string to fit in character array C++

I am trying to fit pinyin string in character array. for example If I have pinyin string like below.
string str = "转换汉字为拼音音"; // needs at least 25 bytes to store
char destination[22];
strncpy(destination, str.c_str(), 20);
destination[21] = '\0';
since Chinese characters takes 3 bytes i can do strncpy(destination, str.c_str(), (20/3)*3); but if str contains any character other than Chinese (that takes 2 bytes of 4 bytes in utf8 encoding) this logic will fill.
Later If i try to convert destination to print pinyin characters, only first 6 Chinese characters are printed properly and 2 bytes are printed in hexadecimal.
Is there any way, I can shorten the string before copying to destination so that when destination is printed, proper Chinese characters are printed (without any individual hex bytes)? using POCO::Textendcoing or POCO::UTF8Encoding class?
Thanks in Advance.
Nothing short of creating own way to encode text would work. But even in that case you would have to create 25 characters (don't forget zero at end!) array to store string at end to be printed properly, unless you create own printing routines.
I.e. amount of work required doesn't balance out win of extra 3 bytes.
Note, that code is practically C. In C++ you don't use that style of code.

wxWidgets and UTF8 - some characters missing

So I have this file encoded in UTF8. I load it and print like this:
char buffer[2048] = {0};
FILE *pFile = fopen("D:/localization.csv","rb");
int iret = fread(buffer,1,2048,pFile);
fclose(pFile);
wxString strMessageText = wxString::FromUTF8(buffer);
wxMessageBox(strMessageText);
The problem is that when the text contains some "invalid" characters, it doesn't get created (length of strMessageText is 0). I noticed, for instance, that Danish or German characters are fine but when I put Polish or Russian chars in the text file the wxString::FromUTF8 function fails to create proper text. Any idea?
If the file contains correctly encoded UTF-8 text, wxString::FromUTF8() will decode it. If it doesn't, you can still use wxMBConvUTF8 with e.g. MAP_INVALID_UTF8_TO_OCTAL to preserve even incorrectly encoded bytes in the input, but this isn't a good idea, in general.
I found solution here https://forums.wxwidgets.org/viewtopic.php?f=1&t=41068
It turned out that my wxWidgets lib was out of date. I had version 2.8.12 and updated to 3.0.2 and it's fine.

Issues with getline/file reading in Windows

I created some .txt files on my Mac (didn't think that would matter at first, but...) so that I could read them in the application I am making in (unfortunately) Visual Studio on a different computer. They are basically files filled with records, with the number of entries per row at the top, e.g.:
2
int int
age name
9 Bob
34 Mary
12 Jim
...
In the code, which I originally just made (and tested successfully) on the Mac, I attempt to read this file and similar ones:
Table TableFromFile(string _filename){ //For a database system
ifstream infile;
infile.open(_filename.c_str());
if(!infile){
cerr << "File " << _filename << " could not be opened.";
exit(1);
}
//Determine number attributes (columns) in table,
//which is number on first line of input file
std::string num;
getline(infile, num);
int numEntries = atoi(num.c_str());
...
...
In short, this causes a crash! As I looked into it, I found some interesting "Error reading characters of string" issues and found that numEntries is getting some crazy negative garbage value. This seems to be caused by the fact that "num", which should just be "2" as read from the first line, is actually coming out as "ÿþ2".
From a little research, it seems that these strange characters are formatting things...perhaps unicode/Mac specific? In any case, they are a problem, and I am wondering if there is a fast and easy way to make the text files I created on my Mac cooperate and behave in Windows just like they did in the Mac terminal. I tried connecting to a UNIX machine, putting a txt file there, running unix2dos on it, and put into back in VS, but to no avail...still those symbols at the start of the line! Should I just make my input files all over again in Windows? I am very surprised to learn that what you see is not always what you get when it comes to characters in a file across platforms...but a good lesson, I suppose.
As the commenter indicated, the bytes you're seeing are the byte order mark. See http://en.wikipedia.org/wiki/Byte_order_mark.
"ÿþ" is 0xFFFE, the UTF-16 "little endian" byte order mark. The "2" is your first actual character (for UTF-16, characters below 256 will be represented by bytes of the for 0xnn00;, where "nn" is the usual ASCII or UTF-8 code for that character, so something trying to read the bytes as ASCII or UTF-8 will do OK until it reaches the first null byte).
If you need to puzzle out the Unicode details of a text file the best tool I know of is the free SC Unipad editor (www.unipad.org). It is Windows-only but can read and write pretty much any encoding and will be able to tell you what there is to know about the file. It is very good at guessing the encoding.
Unipad will be able to open the file and let you save it in whatever encoding you want: ASCII, UTF-8, etc.

Converting ANSI to UTF8 with Ruby

I have a Ruby script that generates an ANSI file.
I want to convert the file to UTF8.
What's the easiest way to do it?
If your data is between ascii range 0 to 0x7F, its valid UTF8, so you don't need to do anything.
Or, if there is characters above 0x7F, you could use Iconv
text=Iconv.iconv('UTF-8', 'ascii',text)
The 8-bit Unicode Transformation Format (UTF-8) was designed to be backwards compatible with the American Standard Code for Information Interchange (ASCII). Therefore, by definition, any valid ASCII sequence is also a valid UTF-8 sequence. For more information, read the UTF FAQ and Unicode FAQ.
Any ASCII file is a valid UTF8 file, going by your Q's title, so no conversion is needed. I don't know what a UIF8 file is, going by your Q's text, so different from its title.

Resources