Umlauts in filenames are truncated (are shown as question marks) - utf-8

On one of our ColdFusion 10 enterprise / CentOS 6.5 servers umlauts in filenames are saved as ?.
For example:
<CFPROCESSINGDIRECTIVE pageencoding="UTF-8">
<CFSET VARIABLES.umlauts = "ümläüté" />
<CFSET VARIABLES.filename = createUUID() & "-" & VARIABLES.umlauts & ".txt" />
<CFFILE action="write" output="#VARIABLES.umlauts#" file="#expandpath("./" & VARIABLES.filename)#" />
<CFOUTPUT>#VARIABLES.filename#</CFOUTPUT> <!--- outputs something like: A9C9BC8C-983A-5EA6-A4ED411BA0E63C72-ümläüté.txt --->
writes a file called A8B49720-020A-2500-605F4CC73129D07C-?ml??t?.txt to disk. The content of the file is like expected "ümläüté".
Manual creating files with umlauts in filename is no problem (e.g. touch äöüß.txt works like expected).
More details of server:
Java Version: 1.6.0_29
Tomcat Version: 7.0.23.0
Java File Encoding: UTF8
$ cat /etc/sysconfig/i18n
LANG="en_US.UTF-8"
$ locale
LANG=de_DE.UTF-8
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
Any ideas what could cause this behaviour?

I'll put it out as an answer for more clear visibility.
A user of Open Blue Dragon (an alternative CFML Engine) was having exactly the same issue.
If I try to upload a file with, for example, the filename "testätest.pdf", then I have the following situation:
The file, OpenBD stores to my filesystem, is named: test?test.pdf
The filename, reported via #cffile.ServerFile# is: testätest.pdf
He later came back with this answer
It seems like this has been resolved by setting "LC_ALL=en_US.UTF-8". It seems to be a tomcat problem that it sets question marks for special characters if the charset is unknown.
Or, in the OP's case, to set LC_All to "de_DE.UTF-8" perhaps.
Source: Issue 516: Special characters (like german "Umlauts") in filenames of uploaded files are replaced with "?"

Related

handling file paths in AppleScript droplet

I created a droplet that is supposed to pass a file (which is being dropped on the droplet) to a website in safari in a file dialog.
I managed most of it... but the file path I get through "theDroppedItems" starts with file:// and it also includes characters like %20 and ~ and more weird characters.
Is there a way to convert this to a "real path" that safari accepts?
solved:
set inx to item 1 of theDroppedItem
set iny to POSIX path of inx

Encoding problem if I put my code in module or other ps1 file

My code was working well with special chars. I could use Write-Host "é" without any issue.
And then I moved some of my functions to an other PS1 file that I "dot sourced" (using Import-Module does the same), and I got encoding errors : prénom became prénom
I don't understand anything about encoding. VS Code doesn't allow me to change the encoding of a file. It has a parameter to set the default encoding but its defaulted on UTF8 and when I set Windows1252 it changes nothing. If I use Geany to update the encoding to Windows1252 it works... until I save the file again with VS Code.
Everything was working well when all my code was in the same file. Why would creating this second .ps1 file (which I created from the Windows Explorer) be a problem?
Working on Windows 10, in french, with VS Code 1.50.
Thank you in advance

How do I properly unzip a zip with Chinese character that from Windows in OSX?

One day I just zipped a file with Chinese character called 周國賢 - 密封罩.flac, to a zip, using bandizip & designated encoding to utf-8.
And then I try to unzip it in my MacbookPro, which is (probably) using Macintosh as encoding. The file unzipped is called ©P∞ÍΩ - ±K´ ∏n.flac, which does not match the above Chinese name.
So, I try to test about the encoding, and found that Macintosh->big5 would return the Macintosh mysterious symbol into Cantonese, but have some unmatching characters: 周衰�璀� - 密封罩.flac.
I have tried another file: §˝µ· - ¨ı®ß.ape: and it actually output the correct name of the file: 王菲 - 紅豆.ape
So, here is my question: how do I unzip a file that with big5 chinese character properly and without any information loss? Or how do I zip a file correctly to prevent information loss/ incorrect characters? (edit #2: you can use bandizip to zip the file into utf-8 encoding)
BTW, The encoding converter I am using is https://r12a.github.io/apps/encodings/, which could be quite helpful for you to check for encoding. Don't forget to click change encodings shown. And I am not the owner of the encoding converter.
edit #1: I have found that the setting in bandizip is wrong...well sorry for the inconvenience caused. Nonetheless, I figure out that The Unarchiver in Mac Apple Store can unzip big5 correctly. This can be a workaround, but still I don't know how to unzip big5 characters properly WITHOUT any loss.

How to automatic search and replace for 18n.properties file with WebStorm

For SAPUI5 there are i18n.properties files.
For the German language I need to replace the special German chars with the unicode codes.
# AE = \u00C4, ae = \u00E4
# OE = \u00D6, oe = \u00F6
# UE = \u00DC, ue = \u00FC
# SZ = \u00DF
How can I automate this search and replace with WebStorm?
You could just use WebStorms 'Replace in Path' (CMD+SHIFT+R on Mac) on your i18n folder. IntelliJ IDEA has better editing support for .properties files though (since they are coming from java)
Will be also easy to do this via a node script/bash script/gulp task whatsoever.
Btw: Is this really needed? Having all .properties files in UTF-8 should just do the trick. Afaik only Tomcat got confused by that since in the Java spec these files are ISO-8859-1 by definition. As long as you are deploying to a platform that accepts them as UTF-8 there shouldn't be an issue.
BR
Chris
PS: That code looks really familiar ;D

UTF-8 i18n file

I'm trying to add a Chinese localisation to a scaffolded Yesod site. I have a zh.msg message file saved as UTF-8 format using Notepad in Windows, but when I run cabal install in the project directory, I get this:
Handler\Home.hs:15:11:
Not in scope: data constructor `MsgHello'
Perhaps you meant `Msg<stderr>: hPutChar: invalid argument (invalid character)
The line in question is where I render my homepage:
$(widgetFile "homepage")
I changed both message files to be Unicode formatted instead of UTF-8, and get this message instead:
Foundation.hs:1:1:
Exception when trying to run compile-time code:
Cannot decode byte '\xff': Data.Text.Encoding.Fusion.streamUtf8: invalid UTF-8 stream
So I guess UTF-8 is the way to go... somehow.
(I'm using Notepad because I haven't set up gVim to render Unicode characters. It's apparently a bit of a feat.)
When I went to commit my changes I discovered the issue. The diff for my English file looked like this:
-Hello: Hello
+<U+FEFF>Hello: Hello
I guess notepad added the character in, and it was working its way into the Haskell code. I solved it using vim according to this answer.

Resources