I have a webservis in php and I encoded the string in utf-8 like this :
$str_output = mb_convert_encoding("MATEMATİK", "UTF-8");
$data_array = array('name' => $str_output);
echo json_encode($data_array);
I get this string from webservis in xcode : MATEMAT\u00ddK
I couldn't convert this string to Turkish string.
My json_dictionary is like this
2014-01-08 16:17:22.274 test_app[6432:70b] {
name = "MATEMAT\U00ddK";
}
I tried this encoding method, but it didn't work for me
NSString * name = [json_dictionary objectForKey:#"name"];
NSString * correctString = [NSString stringWithCString:[baslik cStringUsingEncoding:NSUTF8StringEncoding] encoding:NSWindowsCP1254StringEncoding];
I got null
If I use NSUTF8StringEncoding
MATEMATÝK
Also I tried NSISOLatin1StringEncoding, NSISOLatin2StringEncoding ...
Thanks...
iOS is correctly decoding the \u00dd when you use NSUTF8StringEncoding (which is what you should be using). That's LATIN CAPITAL LETTER Y WITH ACUTE. The letter you want is LATIN CAPITAL LETTER I WITH DOT ABOVE, which is \u0130.
That suggests the problem is on your php side. If I had to guess, I'd suspect that the İ in your source file is not itself in the encoding that php expects. You may need to pass to "from" encoding to mb_convert_encoding depending on what encoding your editor is using.
I would strongly recommend that you stay in UTF-8 entirely if possible, and avoid creating a CP1254 (Turkish) string at all. UTF-8 is capable of encoding all the characters you need. In that case, you may be able to avoid the mb_convert_encoding entirely.
Related
My application is developed in C++'11 and uses Qt5. In this application, I need to store a UTF-8 text as Windows-1250 coded file.
I tried two following ways and both work expect for Romanian 'ș' and 'ț' characters :(
1.
auto data = QStringList() << ... <some texts here>;
QTextStream outStream(&destFile);
outStream.setCodec(QTextCodec::codecForName("Windows-1250"));
foreach (auto qstr, data)
{
outStream << qstr << EOL_CODE;
}
2.
auto data = QStringList() << ... <some texts here>;
auto *codec = QTextCodec::codecForName("Windows-1250");
foreach (auto qstr, data)
{
const QByteArray encodedString = codec->fromUnicode(qstr);
destFile.write(encodedString);
}
In case of 'ț' character (alias 0xC89B), instead of expected 0xFE value, the character is coded and stored as 0x3F, that it is unexpected.
So I am looking for any help or experience / examples regarding text recoding.
Best regards,
Do not confuse ț with ţ. The former is what is in your post, the latter is what's actually supported by Windows-1250.
The character ț from your post is T-comma, U+021B, LATIN SMALL LETTER T WITH COMMA BELOW, however:
This letter was not part of the early Unicode versions, which is why Ţ (T-cedilla, available from version 1.1.0, June 1993) is often used in digital texts in Romanian.
The character referred to is ţ, U+0163, LATIN SMALL LETTER T WITH CEDILLA (emphasis mine):
In early versions of Unicode, the Romanian letter Ț (T-comma) was considered a glyph variant of Ţ, and therefore was not present in the Unicode Standard. It is also not present in the Windows-1250 (Central Europe) code page.
The story of ş and ș, being S-cedilla and S-comma is analogous.
If you must encode to this archaic Windows 1250 code page, I'd suggest replacing the comma variants by the cedilla variants (both lowercase and uppercase) before encoding. I think Romanians will understand :)
I have a good string and a bad string
to handle a bad string I do
bad.encode("iso-8859-1").force_encoding("utf-8")
which makes it readable
if I do
good.encode("iso-8859-1").force_encoding("utf-8")
I get Encoding::UndefinedConversionError: U+05E2 from UTF-8 to ISO-8859-1
both good and bad string are in UTF-8 in the beginning, but the good strings are readable and the bad are, well, bad.
I don't know how to detect if a string is good or not, and I am trying to find a way to work on a string and to make it readable in the correct encoding
something like that
if needs_fixin?(str)
str.encode("iso-8859-1").force_encoding("utf-8")
else
str
end
The only thing I can think of is to catch exception skip the encoding fixing part, but I don't want the code to have exceptions intentionally.
something like str.try(:encode, "iso-8859-1").force_encoding("utf-8") rescue str
bad string is something like
×¢×××× ×¢×¥ ×'××¤×¡× ×פת×ר ×× ××רק××
I suspect your problem is double-encoded strings. This is very bad for various reasons, but the tl;dr here is it's not fully fixable, and you should instead fix the root problem of strings being double-encoded if at all possible.
This produces a double-encoded string with UTF-8 characters:
> str = "汉语 / 漢語"
=> "汉语 / 漢語"
> str.force_encoding("iso-8859-1")
=> "\xE6\xB1\x89\xE8\xAF\xAD / \xE6\xBC\xA2\xE8\xAA\x9E"
> bad = str.force_encoding("iso-8859-1").encode("utf-8")
=> "æ±\u0089è¯ / æ¼¢èª\u009E"
You can then fix it by reinterpreting the double-encoded UTF-8 as ISO-8859-1 and then declaring the encoding to actually be UTF-8
> bad.encode("iso-8859-1").force_encoding("utf-8")
=> "汉语 / 漢語"
But you can't convert the actual UTF-8 string into ISO-8859-1, since there are codepoints in UTF-8 which ISO-8859-1 doesn't have any unambiguous means of encoding
> str.encode("iso-8859-1")
Encoding::UndefinedConversionError: ""\xE6\xB1\x89"" from UTF-8 to ISO-8859-1
Now, you can't actually detect and fix this all the time because "there's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters."
So, the best you're left with is a heuristic. Borshuno's suggestion won't work here because it will actually destroy unconvertable bytes:
> str.encode( "iso-8859-1", fallback: lambda{|c| c.force_encoding("utf-8")} )
.0=> " / "
The best course of action, if at all possible, is to fix your double-encoding issue so that it doesn't happen at all. The next best course of action is to add BOM bytes to your UTF-8 strings if you suspect they may get double-encoded, since you could then check for those bytes and determine whether your string has been re-encoded or not.
> str_bom = "\xEF\xBB\xBF" + str
=> "汉语 / 漢語"
> str_bom.start_with?("\xEF\xBB\xBF")
=> true
> str_bom.force_encoding("iso-8859-1").encode("utf-8").start_with?("\xEF\xBB\xBF")
=> false
If you can presume that the BOM is in your "proper" string, then you can check for double-encoding by checking if the BOM is present. If it's not (ie, it's been re-encoded) then you can perform your decoding routine:
> str_bom.force_encoding("iso-8859-1").encode("utf-8").encode("iso-8859-1").force_encoding("utf-8").start_with?("\xEF\xBB\xBF")
=> true
If you can't be assured of the BOM, then you could use a heuristic to guess whether a string is "bad" or not, by counting unprintable characters, or characters which fall outside of your normal expected result set (your string looks like it's dealing with Hebrew; you could say that any string which consists of >50% non-Hebrew letters is double-encoded, for example), so you could then attempt to decode it.
Finally, you would have to fall back to exception handling and hope that you know which encoding the string was purportedly declared as when it was double-encoded:
str = "汉语 / 漢語"
begin
str.encode("iso-8859-1").encode("utf-8")
rescue Encoding::UndefinedConversionError
str
end
However, even if you know that a string is double-encoded, if you don't know the encoding that it was improperly declared as when it was converted to UTF-8, you can't do the reverse operation:
> bad_str = str.force_encoding("windows-1252").encode("utf-8")
=> "æ±‰è¯ / 漢語"
> bad_str.encode("iso-8859-1").force_encoding("utf-8")
Encoding::UndefinedConversionError: "\xE2\x80\xB0" from UTF-8 to ISO-8859-1
Since the string itself doesn't carry any information about the encoding it was incorrectly encoded from, you don't have enough information to reliably solve it, and are left with iterating through a list of most-likely encodings and heuristically checking the result of each successful re-encode with your Hebrew heuristic.
To echo the post I linked: character encodings are hard.
What is the idiomatic way to remove non-ASCII characters from file contents in D?
I tried:
auto s = (cast(string) std.file.read(myPath)).filter!( a => a < 128 ).array;
which gave me:
std.utf.UTFException#C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1109): Invalid UTF-8 sequence (at index 1)
and s is dstring ; and:
auto s = (cast(string) std.file.read(myPath)).tr("\0-~", "", "cd");
which gives me:
core.exception.UnicodeException#src\rt\util\utf.d(290): invalid UTF-8 sequence
at runtime.
I am trying to parse (with the almost deprecated std.xml module) xml files in a unsupported encoding, but I am ok with removing the offending characters.
If you do anything to consider it a string, D tries to treat it as UTF-8. Instead, treat it as a series of bytes, so replace your cast(string) with cast(ubyte[]) and do the filter.
After reading and filtering it, you can /then/ cast it back into a string. So this should do what you need:
auto s = cast(string) (cast(ubyte[])(std.file.read(myPath)).filter!( a => a < 128 ).array);
I have a ISO-2022-JP-2 string and need to convert it to UTF-8, but I am getting an error.
To be more concrete: I am trying to read an email which is transferred using quoted-printable. This email contains the word tōtatsu (notice the accent above the o) and I am converting the given text like this:
given = "t=1B$(D+W=1B(Btatsu"
text = given.unpack("M*").first #convert from quoted-printable
Basically this will replace =1B with the proper \e escape character and the string in text becomes t␛$(D+W␛(Btatsu.
Wikipedia says that ␛$(D is used to switch to JIS X 0212-1990 and likewise ␛(B is used to switch back to ASCII. Notice that ␛$(D is new in ISO-2022-JP-2, it is not part of the original ISO-2022-JP.
However, the encoding of the string is still ASCII, so I guess I have to force the proper encoding since Ruby has no way of knowing that the actual string is ISO-2022-JP-2?
puts text.encoding # ASCII-8BIT
text = text.force_encoding('iso-2022-jp-2')
Now it turns out that
text.encode('utf-8')
is not able to convert the given string: code converter not found (ISO-2022-JP-2 to UTF-8) (Encoding::ConverterNotFoundError)
How can I convert this string to UTF-8?
It seems like Ruby 2.1 does not support iso-2022-jp-2 encoding:
>> "t\e$(D+W\e(Btatsu".encode('utf-8', 'iso-8859-1')
=> "t\e$(D+W\e(Btatsu"
>> "t\e$(D+W\e(Btatsu".encode('utf-8', 'iso-2022-jp-2')
Encoding::ConverterNotFoundError: code converter not found (ISO-2022-JP-2 to UTF-8)
from (irb):1:in `encode'
from (irb):1
from /home/falsetru/.rvm/rubies/ruby-2.1.2/bin/irb:11:in `<main>'
You can use iconv instead:
require 'iconv'
Iconv.conv('utf-8', 'iso-2022-jp-2', "t\e$(D+W\e(Btatsu")
# => "tōtatsu"
i mount a SMB path using this code
urlStringOfVolumeToMount = [urlStringOfVolumeToMount stringByAddingPercentEscapesUsingEncoding:NSMacOSRomanStringEncoding];
NSURL *urlOfVolumeToMount = [NSURL URLWithString:urlStringOfVolumeToMount];
FSVolumeRefNum returnRefNum;
FSMountServerVolumeSync( (CFURLRef)urlOfVolumeToMount, NULL, NULL, NULL, &returnRefNum, 0L);
Then, i get the content of some paths :
NSMutableArray *content = (NSMutableArray *)[[NSFileManager defaultManager] contentsOfDirectoryAtPath:path error:&error];
My problem is every path in "content" array containing special chars (ü for example) give me 2 chars encoded : ü becomes u¨
when i log bytes using :
[contentItem dataUsingEncoding:NSUTF8StringEncoding];
it gives me : 75cc88 which is u (75) and ¨(cc88)
What i expected is the ü char encoded in utf-8. In bytes, it should be c3bc
I've tried to convert my path using ISOLatin1 encoding, MacOSRoman... but as long as the content path already have 2 separate chars instead of one for ü, any conversion give me 2 chars encoded...
If someone can help, thanks
My configuration : localized in french and using snow leopard.
urlStringOfVolumeToMount = [urlStringOfVolumeToMount stringByAddingPercentEscapesUsingEncoding:NSMacOSRomanStringEncoding];
Unless you specifically need MacRoman for some reason, you should probably be using UTF-8 here.
NSMutableArray *content = (NSMutableArray *)[[NSFileManager defaultManager] contentsOfDirectoryAtPath:path error:&error];
My problem is every path in "content" array containing special chars (ü for example) give me 2 chars encoded : ü becomes u¨
You're expecting composed characters and getting decomposed sequences.
Since you're getting the pathnames from the file-system, this is not a problem: The pathnames are correct as you're receiving them, and as long as you pass them to something that does Unicode right, they will display correctly as well.
Well, four years later I'm struggling with the same thing but for åäö in my case.
Took a lot of time to find the simple solution.
NSString has the necessary comparator built in.
Comparing aString with anotherString where one comes from the array returned by NSFileManagers contentsOfDirectoryAtPath: is as simple as:
if( [aString compare:anotherString] == NSOrderedSame )
The compare method takes care of making both the strings into a comparable canonical format. In effect making them "if they look the same, they are the same"