Qt5 C++ UTF-8 convertion to Windows-1250 of Romanian ș and ț characters - c++11

My application is developed in C++'11 and uses Qt5. In this application, I need to store a UTF-8 text as Windows-1250 coded file.
I tried two following ways and both work expect for Romanian 'ș' and 'ț' characters :(
1.
auto data = QStringList() << ... <some texts here>;
QTextStream outStream(&destFile);
outStream.setCodec(QTextCodec::codecForName("Windows-1250"));
foreach (auto qstr, data)
{
outStream << qstr << EOL_CODE;
}
2.
auto data = QStringList() << ... <some texts here>;
auto *codec = QTextCodec::codecForName("Windows-1250");
foreach (auto qstr, data)
{
const QByteArray encodedString = codec->fromUnicode(qstr);
destFile.write(encodedString);
}
In case of 'ț' character (alias 0xC89B), instead of expected 0xFE value, the character is coded and stored as 0x3F, that it is unexpected.
So I am looking for any help or experience / examples regarding text recoding.
Best regards,

Do not confuse ț with ţ. The former is what is in your post, the latter is what's actually supported by Windows-1250.
The character ț from your post is T-comma, U+021B, LATIN SMALL LETTER T WITH COMMA BELOW, however:
This letter was not part of the early Unicode versions, which is why Ţ (T-cedilla, available from version 1.1.0, June 1993) is often used in digital texts in Romanian.
The character referred to is ţ, U+0163, LATIN SMALL LETTER T WITH CEDILLA (emphasis mine):
In early versions of Unicode, the Romanian letter Ț (T-comma) was considered a glyph variant of Ţ, and therefore was not present in the Unicode Standard. It is also not present in the Windows-1250 (Central Europe) code page.
The story of ş and ș, being S-cedilla and S-comma is analogous.
If you must encode to this archaic Windows 1250 code page, I'd suggest replacing the comma variants by the cedilla variants (both lowercase and uppercase) before encoding. I think Romanians will understand :)

Related

Ruby pack and Latin (high-ASCII) characters

An action outputs a fixed-length string via Ruby's pack function
clean = [edc_unico, sequenza_sede, cliente_id.to_s, nome, indirizzo, cap, comune, provincia, persona, note, telefono, email]
string = clean.pack('A15A5A6A40A35A5A30A2A40A40A18A25')
However, the data is in UTF-8 as to allow latin/high-ascii characters. The result of the pack action is logical. high-ascii characters take the space of 2 regular ascii characters. The resulting string is shortened by 1 space character, defeating the original purpose.
What would be a concise ruby command to interpret high-ascii characters and thus add an extra space at the end of each variable for each high-ascii character, so that the length can be brought to its proper target? (note: I am assuming there is no directive that addresses this specifically, and the whole lot of pack directives is mind-muddling)
update an example where the second line shifts positions based on accented characters
CNFrigo 539 Via Privata Da Via Iseo 6C 20098San Giuliano Milanese MI02 98282410 02 98287686 12886480156 12886480156 Bo3 Euro Giuseppe Frigo Transport 349 2803433 M.Gianoli#Delanchy.Fr S.Galliard#Delanchy.Fr
CNIn's M 497 Via Istituto S.Maria della Pietà, 30173Venezia Ve041 8690111 340 6311408 0041 5136113 00115180283 02896940273 B60Fm Euro Per Documentazioni Tecniche Inviare Materiale A : Silvia_Scarpa#Insmercato.It Amministrazione : Michela_Bianco#Insmercato.It Silvia Scarpa Per Liberatorie 041/5136171 Sig.Ra Bianco Per Pagamento Fatture 041/5136111 (Solo Il Giovedi Pomeriggio Dalle 14 All Beniservizi.Insmercato#Pec.Gruppopam.It
It looks like you are trying to use pack to format strings to fixed width columns for display. That’s not what it’s for, it is generally used for packing data into fixed byte structures for things like network protocols.
You probably want to use a format string instead, which is better suited for manipulating data for display.
Have a look at String#% (i.e. the % method on string). Like pack it uses another little language which is defined in Kernel#sprintf.
Taking a simplified example, with the two arrays:
plain = ["Iseo", "Next field"]
accent = ["Pietà", "Next field"]
then using pack like this:
puts plain.pack("A10A10")
puts accent.pack("A10A10")
will produce a result that looks like this, where “Next field” isn’t aligned since pack is dealing with the width in bytes, not the displayed width:
Iseo Next field
Pietà Next field
Using a format string, like this:
puts "%-10s%-10s" % plain
puts "%-10s%-10s" % accent
produces the desired result, since it is dealing with the displayable width:
Iseo Next field
Pietà Next field

Compare unicode std::string with usual "" literal or u8"" declartion

On Windows with Visual Studio 2015
// Ü
// UTF-8 (hex) 0xC3 0x9C
// UTF-16 (hex) 0x00DC
// UTF-32 (hex) 0x000000DC
using namespace std::string_literals;
const auto narrow_multibyte_string_s = "\u00dc"s;
const auto wide_string_s = L"\u00dc"s;
const auto utf8_encoded_string_s = u8"\u00dc"s;
const auto utf16_encoded_string_s = u"\u00dc"s;
const auto utf32_encoded_string_s = U"\u00dc"s;
assert(utf8_encoded_string_s == "\xC3\x9C");
assert(narrow_multibyte_string_s == "Ü");
assert(utf8_encoded_string_s == u8"Ü");
// here is the question
assert(utf8_encoded_string_s != narrow_multibyte_string_s);
"\u00dc"s is not the same as u8"\u00dc"s or "Ü"s is not the same as u8"Ü"s
Apparently the default encoding for usual string literal is not UTF-8 (Probably UTF-16) and I cannot just compare two std::string without knowing its encoding even they have the same semantic.
What is the practice to perform such string comparison in unicode-enable c++ application development??
For example an API like this:
class MyDatabase
{
bool isAvailable(const std::string& key)
{
// *compare* key in database
if (key == "Ü")
return true;
else
return false;
}
}
Other programs may call isAvailable with std::string in UTF-8 or default (UTF-16?) encoding. How can I garantee to do the proper comparision?
can I detect any encoding mismatch in compile-time?
Note: I prefer C++11/14 stuff.
Prefer std::string than std::wstring
"\u00dc" is a char[] encoded in whatever the compiler/OS's default 8-bit encoding happens to be, so it can be different on different machines. On Windows, that tends to be the OS's default Ansi encoding, or it could be the encoding that the source file is saved as.
L"\u00dc" is a wchar_t[] encoded with either UTF-16 or UTF-32, depending on the compiler's definition of wchar_t (which is 16-bit on Windows, so UTF-16).
u8"\u00dc" is a char[] encoded in UTF-8.
u"\u00dc" is a char16_t[] encoded in UTF-16.
U"\u00dc" is a char32_t[] encoded in UTF-32.
The ""s suffix simply returns a std::string, std::wstring, std::u16string, or std::u32string, depending on whether a char[], wchar_t[], char16_t[], or char32_t[] is passed to it.
When comparing two strings, make sure they are in the same encoding first. This is especially important for your char[]/std::string data, as it could be in any number of 8-bit encodings, depending on the systems involved. This is not so much a problem if the app is generating the strings itself, but it is important if one or more of the strings is coming from an external source (file, user input, network protocol, etc).
In your example, "\u00dc" and "Ü" are not necessarily guaranteed to produce the same char[] sequence, depending on how the compiler interprets those different literals. But even if they did (which seems to be the case in your example), neither of them will likely produce UTF-8 (you have to go to extra measures to force that), which is why your comparison to utf8_encoded_string_s fails.
So, if you are expecting a string literal to be UTF-8, use u8"" to ensure that. If you are getting string data from an external source and need it to be in UTF-8, convert it to UTF-8 in code as soon as possible, if it is not already (which means you have to know the encoding used by the external source).

How to convert utf-8 encoded string to Turkish characters in Xcode?

I have a webservis in php and I encoded the string in utf-8 like this :
$str_output = mb_convert_encoding("MATEMATİK", "UTF-8");
$data_array = array('name' => $str_output);
echo json_encode($data_array);
I get this string from webservis in xcode : MATEMAT\u00ddK
I couldn't convert this string to Turkish string.
My json_dictionary is like this
2014-01-08 16:17:22.274 test_app[6432:70b] {
name = "MATEMAT\U00ddK";
}
I tried this encoding method, but it didn't work for me
NSString * name = [json_dictionary objectForKey:#"name"];
NSString * correctString = [NSString stringWithCString:[baslik cStringUsingEncoding:NSUTF8StringEncoding] encoding:NSWindowsCP1254StringEncoding];
I got null
If I use NSUTF8StringEncoding
MATEMATÝK
Also I tried NSISOLatin1StringEncoding, NSISOLatin2StringEncoding ...
Thanks...
iOS is correctly decoding the \u00dd when you use NSUTF8StringEncoding (which is what you should be using). That's LATIN CAPITAL LETTER Y WITH ACUTE. The letter you want is LATIN CAPITAL LETTER I WITH DOT ABOVE, which is \u0130.
That suggests the problem is on your php side. If I had to guess, I'd suspect that the İ in your source file is not itself in the encoding that php expects. You may need to pass to "from" encoding to mb_convert_encoding depending on what encoding your editor is using.
I would strongly recommend that you stay in UTF-8 entirely if possible, and avoid creating a CP1254 (Turkish) string at all. UTF-8 is capable of encoding all the characters you need. In that case, you may be able to avoid the mb_convert_encoding entirely.

Converting Characters to ASCII Code & Vice Versa In C++/CLI

I am currently learning c++/cli and I want to convert a character to its ASCII code decimal and vice versa( example 'A' = 65 ).
In JAVA, this can be achieved by a simple type casting:
char ascci = 'A';
char retrieveASCII =' ';
int decimalValue;
decimalValue = (int)ascci;
retrieveASCII = (char)decimalValue;
Apparently this method does not work in c++/cli, here is my code:
String^ words = "ABCDEFG";
String^ getChars;
String^ retrieveASCII;
int decimalValue;
getChars = words->Substring(0, 1);
decimalValue = Int32:: Parse(getChars);
retrieveASCII = decimalValue.ToString();
I am getting this error:
A first chance exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll
Additional information: Input string was not in a correct format.
Any Idea on how to solve this problem?
Characters in a TextBox::Text property are in a System::String type. Therefore, they are Unicode characters. By design, the Unicode character set includes all of the ASCII characters. So, if the string only has those characters, you can convert to an ASCII encoding without losing any of them. Otherwise, you'd have to have a strategy of omitting or substituting characters or throwing an exception.
The ASCII character set has one encoding in current use. It represents all of its characters in one byte each.
// using ::System::Text;
const auto asciiBytes = Encoding::ASCII->GetBytes(words->Substring(0,1));
const auto decimalValue = asciiBytes[0]; // the length is 1 as explained above
const auto retrieveASCII = Encoding::ASCII->GetString(asciiBytes);
Decimal is, of course, a representation of a number. I don't see where you are using decimal except in your explanation. If you did want to use it in code, it could be like this:
const auto explanation = "The encoding (in decimal) "
+ "for the first character in ASCII is "
+ decimalValue;
Note the use of auto. I have omitted the types of the variables because the compiler can figure them out. It allows the code to be more focused on concepts rather than boilerplate. Also, I used const because I don't believe the value of "variables" should be varied. Neither of these is required.
BTW- All of this applies to Java, too. If your Java code works, it is just out of coincidence. If it had been written properly, it would have been easy to translate to .NET. Java's String and Charset classes have very similar functionality as .NET String and Encoding classes. (Encoding to the proper term, though.) They both use the Unicode character set and UTF-16 encoding for strings.
More like Java than you think
String^ words = "ABCDEFG";
Char first = words [0];
String^ retrieveASCII;
int decimalValue = ( int)first;
retrieveASCII = decimalValue.ToString();

How to remove these kind of symbols (junk) from string?

Imagine I have String in C#: "I Don’t see ya.."
I want to remove (replace to nothing or etc.) these "’" symbols.
How do I do this?
That 'junk' looks a lot like someone interpreted UTF-8 data as ISO 8859-1 or Windows-1252, probably repeatedly.
’ is the sequence C3 A2, E2 82 AC, E2 84 A2.
UTF-8 C3 A2 = U+00E2 = â
UTF-8 E2 82 AC = U+20AC = €
UTF-8 E2 84 A2 = U+2122 = ™
We then do it again: in Windows 1252 this sequence is E2 80 99, so the character should have been U+2019, RIGHT SINGLE QUOTATION MARK (’)
You could make multiple passes with byte arrays, Encoding.UTF8 and Encoding.GetEncoding(1252) to correctly turn the junk back into what was originally entered. You will need to check your processing to find the two places that UTF-8 data was incorrectly interpreted as Windows-1252.
"I Don’t see ya..".Replace( "’", string.Empty);
How did that junk get in there the first place? That's the real question.
By removing any non-latin character you'll be intentionally breaking some internationalization support.
Don't forget the poor guy who's name has a "â" in it.
This looks disturbingly familiar to a character encoding issue dealing with the Windows character set being stored in a database using the standard character encoding. I see someone voted Will down, but he has a point. You may be solving the immediate issue, but the combinations of characters are limitless if this is the issue.
If you really have to do this, regular expressions are probably the best solution.
I would strongly recommend that you think about why you have to do this, though - at least some of the characters your listing as undesirable are perfectly valid and useful in other languages, and just filtering them out will most likely annoy at least some of your international users. As a swede, I can't emphasize enough how much I hate systems that can't handle our å, ä and ö characters correctly.
Consider Regex.Replace(your_string, regex, "") - that's what I use.
Test each character in turn to see if it is a valid alphabetic or numeric character and if not then remove it from the string. The character test is very simple, just use...
char.IsLetterOrDigit;
Please there are various others such as...
char.IsSymbol;
char.IsControl;
Regex.Replace("The string", "[^a-zA-Z ]","");
That's how you'd do it in C#, although that regular expression ([^a-zA-Z ]) should work in most languages.
[Edited: forgot the space in the regex]
The ASCII / Integer code for these characters would be out of the normal alphabetic Ranges. Seek and replace with empty characters. String has a Replace method I believe.
Either use a blacklist of stuff you do not want, or preferably a white list (set). With a white list you iterate over the string and only copy the letters that are in your white list to the result string. You said remove, and the way you do that is having two pointers one you read from (R) and one you write to (W):
I Donââ‚
W R
if comma is in your whitelist then you would in this case read the comma and write it where à is then advance both pointers. UTF-8 is a multi-byte encoding, so you advancing the pointer may not just be adding to the address.
With C an easy to way to get a white list by using one of the predefined functions (or macros): isalnum, isalpha, isascii, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit. In this case you send up with a white list function instead of a set of course.
Usually when I see data like you have I look for memory corruption, or evidence to suggest that the encoding I expect is different than the one the data was entered with.
/Allan
I had the same problem with extraneous junk thrown in by adobe in an EXIF dump. I spent an hour looking for a straight answer and trying numerous half-baked suggestions which did not work here.
This thread more than most I have read was replete with deep, probing questions like 'how did it get there?', 'what if somebody has this character in their name?', 'are you sure you want to break internationalization?'.
There were some impressive displays of erudition positing how this junk could have gotten here and explaining the evolution of the various character encoding schemes. The person wanted to know how to remove it, not how it came to be or what the standards orgs are up to, interesting as this trivia may be.
I wrote a tiny program which gave me the right answer. Instead of paraphrasing the main concept, here is the entire, self-contained, working (at least on my system) program and the output I used to nuke the junk:
#!/usr/local/bin/perl -w
# This runs in a dos window and shows the char, integer and hex values
# for the weird chars. Install the HEX values in the REGEXP below until
# the final test line looks normal.
$str = 's: “Brian'; # Nuke the 3 werid chars in front of Brian.
#str = split(//, $str);
printf("len str '$str' = %d, scalar \#str = %d\n",
length $str, scalar #str);
$ii = -1;
foreach $c (#str) {
$ii++;
printf("$ii) char '$c', ord=%03d, hex='%s'\n",
ord($c), unpack("H*", $c));
}
# Take the hex characters shown above, plug them into the below regexp
# until the junk disappears!
($s2 = $str) =~ s/[\xE2\x80\x9C]//g; # << Insert HEX values HERE
print("S2=>$s2<\n"); # Final test
Result:
M:\new\6s-2014.1031-nef.halloween>nuke_junk.pl
len str 's: GÇ£Brian' = 11, scalar #str = 11
0) char 's', ord=115, hex='73'
1) char ':', ord=058, hex='3a'
2) char ' ', ord=032, hex='20'
3) char 'G', ord=226, hex='e2'
4) char 'Ç', ord=128, hex='80'
5) char '£', ord=156, hex='9c'
6) char 'B', ord=066, hex='42'
7) char 'r', ord=114, hex='72'
8) char 'i', ord=105, hex='69'
9) char 'a', ord=097, hex='61'
10) char 'n', ord=110, hex='6e'
S2=>s: Brian<
It's NORMAL!!!
One other actionable, working suggestion I ran across:
iconv -c -t ASCII < 6s-2014.1031-238246.halloween.exf.dif > exf.ascii.dif
If String having the any Junk date , This is good to way remove those junk date
string InputString = "This is grate kingdom¢Ã‚¬â";
string replace = "’";
string OutputString= Regex.Replace(InputString, replace, "");
//OutputString having the following result
It's working good to me , thanks for looking this review.

Resources