I want to show Shift-jis characters but only when displaying it. Store in UTF-8 and show in Shift-jis, so what is the solution to do that in Smarty?
You cannot mix different charsets/encodings in the output to the browser. So you can either send UTF-8 OR Shift-jis.
You can use UTF-8 internally and in an outputfilter convert the complete output from UTF-8 to Shift-jis (using mb_convert_encoding).
Smarty is not (really) equipped to deal with charsets other than ASCII supersets (like Latin1, UTF-8) internally.
Related
We have a content rendering site that displays content in multiple languages. The site is built using JSP and content fetched from Oracle DB. All our pages are UTF-8 compliant.
When displaying zh/jp content, out of the complete content, only some of the character appear garbled (square boxes on IE and diamond question marks on FF). The data in DB does not have any garbled character. Since we dont understand the language, we dont know what characters are problematic. Would appreciate some pointers to the solution please. Could it be that some characters may be appearing invalid to the browsers?
Example in FF:
ネット犯���者 がアプ
脆弱性保護機能 - ネット犯���者 がアプリケーションのセキュリティホール (脆弱性) を突いて、パソコンに脅威を侵入させることを阻止します。
In doubt on the sanity of a UTF-8 encoding, you can always re-encode it either with a good text editor or with a specialized tool like iconv :
iconv -f UTF-8 -t UTF-8 yourfile > youfile2
If your file is indeed invalid, iconv will also give you some information on the problem.
But, another way you might want to explore is installing new fonts for Far East languages…
Indeed, not knowing the actual bytes used in your file, it is hard to say why their are replaced with a replacement character � (U+FFFD). Thus, you might want to post a hex dump of the parts of your file that do not work.
I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.
I would like to write a Ruby script which writes Japanese characters to the console. For example:
puts "こんにちは・今日は"
However, I get an exception when running it:
jap.rb:1: Invalid char `\377' in expression
jap.rb:1: Invalid char `\376' in expression
Is it possible to do? I'm using Ruby 1.8.6.
You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.
What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.
Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.
And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):
puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"
However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.
So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:
puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"
But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).
(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)
My application has to write data to an XML file which will be read by a swf file. The swf expects the data in the XML to be in UTF-8 encoding. I have to convert some Multibyte characters in my app(Chinese simplified, Japanese, Korean etc..) to UTF-8.
Are there any API calls which could allow me to do this?I would prefer not to use any 3rd party dlls. I need to do it both on Windows and on Mac and would prefer any system API's if available.
Thanks
jbsp72
UTF-8 is a multibyte encoding (Well, a variable byte-length encoding to be precise). Stating that you need to convert from a multibyte encoding is not enough. You need to specify which multibye encoding your source is?
I have to convert some Multibyte
characters in my app(Chinese
simplified, Japanese, Korean etc..) to
UTF-8.
if your original string is in multibyte (chinese/arabic/thai/etc..) and you need to convert it to other multibyte (UTF-8), One way is to convert to WideCharacter(UTF-16) first, then convert back to multibyte.
multibyte(chinese/arabic/thai/etc) -> widechar(UTF-16) -> multibyte(UTF-8)
if your original string is already in Unicode(UTF-16), you can skip the first conversion in the above illustration
you can refer the codepage from MSDN.
Google Chrome has some string conversion implementations for Windows, Linux, and Mac. You can see it here or here. the files are under src/base:
+ sys_string_conversions.h
+ sys_string_conversions_linux.cc
+ sys_string_conversions_win.cc
+ sys_string_conversions_mac.mm
The code uses BSD license so you can use it for commercial projects.
Is there a way to identify whether the browser encoding is set to/supports "UTF-8" from Javascript?
I want to send "UTF-8" or "English" letters based on browser setting transparently (i.e. without asking the User)
Edit: Sorry I was not very clear on the question. In a Browser the encoding is normally specified as Auto-Detect (or) Western (Windows/ISO-9959-1) (or) Unicode (UTF-8). If the user has set the default to Western then the characters I send are not readable. In this situation I want to inform the user to either set the encoding to "Auto Detect" (or) "UTF-8".
First off, UTF-8 is an encoding of the Unicode character set. English is a language. I assume you mean 'ASCII' (a character set and its encoding) instead of English.
Second, ASCII and UTF-8 overlap; any ASCII character is sent as exactly the same bits when sent as UTF-8. I'm pretty sure all modern browsers support UTF-8, and those that don't will probably just treat it as latin1 or cp1252 (both of which overlap ASCII) so it'll still work.
In other words, I wouldn't worry about it.
Just make sure to properly mark your documents as UTF-8, either in the HTTP headers or the meta tags.
I assume the length of the output (that you read back after outputting it) can tell you what happened (or, without JavaScript, use the Accept-Charset HTTP header, and assume the UTF-8 encoding is supported when Unicode is accepted).
But you'd better worry about sending the correct UTF-8 headers et cetera, and fallback scenarios for accessibility, rather than worrying about the current browsers' UTF-8 capabilities.