How to use encoding with NSString?

How to use encoding with NSString? - xcode

Are there any solution to use encoding:NSUTF8StringEncoding in this code:
NSString *aa = [[theNews objectAtIndex:indexPath.section]objectForKey:#"Title"];
cell.titleLabel.text = [aa encoding:NSUTF8StringEncoding];
Because I have this result
"تجارب كتابة بالعربي";
This is arabic words
"تجارب كتابة بالعربي"

"ت" is not UTF-8 encoded. It is XML-encoded as entities. The UTF-8 version of that is 2 bytes long (0x06 followed by 0x2A). Yours is a 7-character string.
The tool you want is CFXML:
NSString *
arabic = CFBridgingRelease(CFXMLCreateStringByUnescapingEntities(NULL,
(__bridge void*)string,
NULL));

Related

NSAttributedString initWithHTML incorrect character encoding?

-[NSMutableAttributedString initWithHTML:documentAttributes:] seems to mangle special characters:
NSString *html = #"“Hello” World"; // notice the smart quotes
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:nil];
NSLog(#"%#", as);
That prints â€œHelloâ€ World followed by some RTF commands. In my application, I convert the attributed string to RTF and display it in an NSTextView, but the characters are corrupted there, too.
According to the documentation, the default encoding is UTF-8, but I tried being explicit and the result is the same:
NSDictionary *attributes = #{NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]};
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:&attributes];

Use [html dataUsingEncoding:NSUnicodeStringEncoding] when creating the NSData and set the matching encoding option when you parse the HTML into an attributed string:
The documentation for NSCharacterEncodingDocumentAttribute is slightly confusing:
NSNumber, containing an int specifying the NSStringEncoding for the
file; for reading and writing plain text files and writing HTML;
default for plain text is the default encoding; default for HTML is
UTF-8.
So, you code should be:
NSString *html = #"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *options = #{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: #(NSUTF8StringEncoding)};
NSMutableAttributedString *as =
[[NSMutableAttributedString alloc] initWithHTML:htmlData
options: options
documentAttributes:nil];

The previous answer here works, but mostly by accident.
Making an NSData with NSUnicodeStringEncoding will tend to work, because that constant is an alias for NSUTF16StringEncoding, and UTF-16 is pretty easy for the system to identify. Easier than UTF-8, which apparently was being identified as some other superset of ASCII (it looks like NSWindowsCP1252StringEncoding in your case, probably because it's one of the few ASCII-based encodings with mappings for 0x8_ and 0x9_).
That answer is mistaken in quoting the documentation for NSCharacterEncodingDocumentAttribute, because "attributes" are what you get out of -initWithHTML. That's why it's NSDictionary ** and not just NSDictionary *. You can pass in a pointer to an NSDictionary *, and you'll get out keys like TopMargin/BottomMargin/LeftMargin/RightMargin, PaperSize, DocumentType, UTI, etc. Any values you try to pass in through the "attributes" dictionary are ignored.
You need to use "options" for passing values in, and the relevant option key is NSTextEncodingNameDocumentOption, which has no documented default value. It's passing the bytes to WebKit for parsing, so if you don't specify an encoding, presumably you're getting WebKit's encoding-guessing heuristics.
To guarantee the encoding types match between your NSData and NSAttributedString, what you should do is something like:
NSString *html = #"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSMutableAttributedString *as =
[[NSMutableAttributedString alloc] initWithHTML:htmlData
options:#{NSTextEncodingNameDocumentOption: #"UTF-8"}
documentAttributes:nil];

Swift version of accepted answer is:
let htmlString: String = "Hello world contains html</br>"
let data: Data = Data(htmlString.utf8)
let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
let attributedString = try? NSAttributedString(data: data,
options: options,
documentAttributes: nil)

NSString encoding from strings with chars like \u00f6 to UTF8

I get a NSStrings with characters like \u00f6. I can't find how to encode it to UTF8.
NSString *resultString = [NSString stringWithContentsOfURL:wikiSearchURL usedEncoding:NSUTF8StringEncoding error:&err];
Thanks...

I think you want to do this:
NSString *resultString = [NSString stringWithContentsOfURL:wikiSearchURL encoding:NSUTF8StringEncoding error:&err];
the usedEncoding: one will tell you what encoding it used when it parsed the URL, while the encoding: one will force it to use a particular encoding.

NSString conceptually uses UTF-16 as its inernal format. 0x00F6 is a perfectly valid character to find in an NSString. It's o-umlaut. If you want to convert the string to UTF-8, use -UTF8String
const char* foo = [myString UTF8String];
Note that your line of code which gets a string from a URL and tries to figure out which encoding was used is wrong. You should use something like:
NSStringEncoding theEncoding;
NSString *resultString = [NSString stringWithContentsOfURL: wikiSearchURL usedEncoding: &theEncoding error:&err];
Assuming the returned string is not nil, theEncoding will now contain the encoding that was used to convert the URL content to the string.

ASCII to NSData

This is another crack at my MD5 problem. I know the issue is with the ASCII character © (0xa9, 169). Either it is the way I am inserting the character into the string or its a higher vs lower byte problem.
If I
NSString *source = [NSString stringWithFormat:#"%c", 0xa9];
NSData *data = [source dataUsingEncoding:NSASCIIStringEncoding];
NSLog(#"\n\n ############### source %# \ndata desc %#", source, [data description]);
CC_MD5([data bytes], [data length], result);
return [NSString stringWithFormat:
#"%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
result[0], result[1], result[2], result[3],
result[4], result[5], result[6], result[7],
result[8], result[9], result[10], result[11],
result[12], result[13], result[14], result[15]
];
Result:
######### source ©
[data description] = (null)
md5: d41d8cd98f00b204e9800998ecf8427e
values: int 169 char ©
When I change the encoding to
NSData *data = [NSData dataWithBytes:[source UTF8String] length:[source length]];
The result is
######### source ©
[data description] = "<"c2>
md5: 6465dad1d31752be3f3283e8f70feef7
When I change the encoding to
NSData *data = [NSData dataWithBytes:[source UTF8String] length:[source lengthOfBytesUsingEncoding:NSUTF8StringEncoding]];
The result is
############### source © len 2
[data description] = "<"c2a9>
md5: a541ecda3d4c67f1151cad5075633423
When I run the same function in Java I get
">>>>> msg## \251 \251
md5 a252c2c85a9e7756d5ba5da9949d57ed
The question is what is the best way to get the same byte in objC as I get in Java?

“ASCII to NSData” makes no sense, because ASCII is an encoding; if you have encoded characters, then you have data.
An encoding is a transformation of ideal Unicode characters (code points) into one-or-more-byte units (code units), possibly in sequences such as UTF-16's surrogate pairs.
An NSString is more or less an ideal Unicode object. It contains the characters of the string, in Unicode, irrespective of any encoding*.
ASCII is an encoding. UTF-8 is also an encoding. When you ask the string for its UTF8String, you are asking it to encode its characters as UTF-8.
NSData *data = [NSData dataWithBytes:[source UTF8String] length:[source length]];
The result is
######### source ©
[data description] = "<"c2>
That's because you passed the wrong length. The string's length (in characters) is not the same as the number of code units (bytes, in this case) in some encoding.
The correct length is strlen([source UTF8String]), but it's easier for you and faster at run time to use dataUsingEncoding: to ask the string to create the NSData object for you.
When I change the encoding to
NSData *data = [NSData dataWithBytes:[source UTF8String] length:[source lengthOfBytesUsingEncoding:NSUTF8StringEncoding]];
You didn't change the encoding. You're still encoding it as UTF-8.
Use dataUsingEncoding:.
The question is what is the best way to get the same byte in objC as I get in Java?
Use the same encoding.
There is no such thing as “extended ASCII”. There are several different encodings that are based on (or at least compatible with) ASCII, including ISO 8859-1, ISO 8859-9, MacRoman, Windows codepage 1252, and UTF-8. You need to decide which one you mean and tell the string to encode its characters with that.
Better yet, continue using UTF-8—it is almost always the right choice for mostly-ASCII text—and change your Java code instead.
NSData *data = [source dataUsingEncoding:NSASCIIStringEncoding];
Result:
[data description] = (null)
True ASCII can only encode 128 possible characters. Unicode includes all of ASCII unchanged, so the first 128 code points in Unicode are what ASCII can encode. Anything else, ASCII cannot encode.
I've seen NSASCIIStringEncoding behave as equivalent to NSISOLatin1StringEncoding before; it sounds like they might have changed it to be a pure ASCII encoding, and if that's the case, that's a good thing. There is no copyright symbol in ASCII. What you see here is the correct result.
*This is not quite true; the characters are exposed as UTF-16, so any characters outside the Basic Multilingual Plane are exposed as surrogate pairs, not whole characters as they would be in a truly ideal string object. This is a trade-off. In Swift, the built-in String type is a perfect ideal Unicode object; characters are characters, never divided until encoded. But when working with NSString (whether in Swift or in Objective-C), as far as you are concerned, you should treat it as an ideal string.

Thanks to GBegan's explanation in another post I was able to cobble this together.
for(int c = 0; c < [s length]; c++){
int number = [s characterAtIndex:c];
unsigned char c[1];
c[0] = (unsigned char)number;
NSMutableData *oneByte = [NSMutableData dataWithBytes:&c length:1];
}

NSString isEqualToString: does not work. Why?

I create an NSString using,
NSString *myString = [[NSString alloc] initWithBytes:someBuffer length:sizeof(someBuffer) encoding:NSASCIIStringEncoding];
I used NSLog to output myString and it displays "Hello".
If this is the case, then why does this fail.
NSString *helloString = #"Hello"
BOOL check = [myString isEqualToString:helloString];

Your myString variable is actually an NSString with a length of 64; the additional characters are probably undefined. What you most likely want to do is this:
NSString *myString = [[NSString alloc] initWithBytes:someBuffer length:strlen(someBuffer) encoding:NSASCIIStringEncoding];
This assumes a null-terminated C-string exists in your buffer.

There are probably some trailing characters that you can't see when calling NSLog(). For example: whitespace, linefeeds or even '\0' characters.
Check [myString length] to see if it returns 5.

How do I convert NSString to NSData?

I have this line of code to convert NSString to NSData:
NSData *data = [NSData dataWithBytes:[message UTF8String] length:[message lengthOfBytesUsingEncoding:NSUTF8StringEncoding]];
How do I do this in Unicode instead of UTF8? My message may contain cyrillic characters or diacritical marks.

First off, you should use dataUsingEncoding: instead of going through UTF8String. You only use UTF8String when you need a C string in that encoding.
Then, for “Unicode” (specifically, UTF-16), just pass NSUnicodeStringEncoding instead of NSUTF8StringEncoding in your dataUsingEncoding: message.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio