ColdFusion: convert accented regional characters to plain ASCII - utf-8

I need to convert characters in French, Sweden and others language in their "normal" standard ASCII format.
I don't know how to explain, here's an example:
ç -> c
ò -> o
...
In bash Unix I would use iconv. How can I do in ColdFusion9 / Java?

I found this simple UDF at CFLib.org:
deAccent
<cfscript>
/**
* Replaces accented characters with their non accented closest equivalents.
*
* #return Returns a string.
* #author Rachel Lehman (raelehman#gmail.com)
* #version 1, November 15, 2010
*/
function deAccent(str){
var newstr = "";
var list1 = "á,é,í,ó,ú,ý,à,è,ì,ò,ù,â,ê,î,ô,û,ã,ñ,õ,ä,ë,ï,ö,ü,ÿ,À,È,Ì,Ò,Ù,Á,É,Í,Ó,Ú,Ý,Â,Ê,Î,Ô,Û,Ã,Ñ,Õ,Ä,Ë,Ï,Ö,Ü,x";
var list2 = "a,e,i,o,y,u,a,e,i,o,u,a,e,i,o,u,a,n,o,a,e,i,o,u,y,A,E,I,O,U,A,E,I,O,U,Y,A,E,I,O,U,A,N,O,A,E,I,O,U,Y";
newstr = ReplaceList(str,list1,list2);
return newstr;
}
</cfscript>

you can also use the charsetEncode function built in to CF.
encodedString = charsetEncode(stringToBeConverted, "utf-8");

Related

Get emoji flag from country code in Ruby

I want to convert a country code like "US" to an Emoji flag, ie transform "US" string to the appropriate Unicode in Ruby.
Here's an equivalent example for Java
Use tr to translate alphabetic characters to their regional indicator symbols:
'US'.tr('A-Z', "\u{1F1E6}-\u{1F1FF}")
#=> "🇺🇸"
Of course, you can also use the Unicode characters directly:
'US'.tr('A-Z', '🇦-🇿')
#=> "🇺🇸"
Here is a port of that to Ruby:
country = 'US'
flagOffset = 0x1F1E6
asciiOffset = 0x41
firstChar = country[0].ord - asciiOffset + flagOffset
secondChar = country[1].ord - asciiOffset + flagOffset
flag = [firstChar, secondChar].pack("U*")

VB function that strips accents from a string

I'm trying to have a function which takes a string and returns the same string without any accented letters. Instead, the accented letters should return the same letter without the accent. This function is not working:
function StripAccents(str)
accent = "ÈÉÊËÛÙÏÎÀÂÔÖÇèéêëûùïîàâôöç"
noaccent = "EEEEUUIIAAOOCeeeeuuiiaaooc"
currentChar = ""
result = ""
k = 0
o = 0
FOR k = 1 TO len(str)
currentChar = mid(str,k, 1)
o = InStr(accent, currentChar)
IF o > 0 THEN
result = result & mid(noaccent,k,1)
ELSE
result = result & currentChar
END IF
NEXT
StripAccents = result
End function
testStr = "Test : à é À É ç"
response.write(StripAccents(testStr))
This is the result using the above:
Test : E E Eu EE E
Disregarding possible encoding problems - you must change
result = result & mid(noaccent,k,1)
to
result = result & mid(noaccent,o,1)
I tried the example code with the correction added
Then I added more characters
Giving:
accent = "àèìòùÀÈÌÒÙäëïöüÄËÏÖÜâêîôûÂÊÎÔÛáéíóúÁÉÍÓÚðÐýÝãñõÃÑÕšŠžŽçÇåÅøØ"
noaccent = "aeiouAEIOUaeiouAEIOUaeiouAEIOUaeiouAEIOUdDyYanoANOsSzZcCaAoO"
Now I realised that there are a few more to deal with, namely
æ
Æ
ß
These need converting first using a simple replace them with ae AE and ss
Then it works fine other than it is important to not have <%#LANGUAGE="VBSCRIPT" CODEPAGE="65001"%> or similar in the code
However having meta charset="UTF-8" in the header is not a big issue, it converts fine.
So if the code is needed on the page with <%#LANGUAGE="VBSCRIPT" CODEPAGE="65001"%> in it, I do not know any answer to that
Thanks for the code greener, very useful for dealing with the common diacriticals :-)
You should probably do a decomposition normalization first (NFD). I think you could do this in VBA using a call to the WinAPI function NormalizeString (https://msdn.microsoft.com/en-us/library/windows/desktop/dd319093(v=vs.85).aspx). Then, you could remove the accent code points.

How to convert utf-8 encoded string to Turkish characters in Xcode?

I have a webservis in php and I encoded the string in utf-8 like this :
$str_output = mb_convert_encoding("MATEMATİK", "UTF-8");
$data_array = array('name' => $str_output);
echo json_encode($data_array);
I get this string from webservis in xcode : MATEMAT\u00ddK
I couldn't convert this string to Turkish string.
My json_dictionary is like this
2014-01-08 16:17:22.274 test_app[6432:70b] {
name = "MATEMAT\U00ddK";
}
I tried this encoding method, but it didn't work for me
NSString * name = [json_dictionary objectForKey:#"name"];
NSString * correctString = [NSString stringWithCString:[baslik cStringUsingEncoding:NSUTF8StringEncoding] encoding:NSWindowsCP1254StringEncoding];
I got null
If I use NSUTF8StringEncoding
MATEMATÝK
Also I tried NSISOLatin1StringEncoding, NSISOLatin2StringEncoding ...
Thanks...
iOS is correctly decoding the \u00dd when you use NSUTF8StringEncoding (which is what you should be using). That's LATIN CAPITAL LETTER Y WITH ACUTE. The letter you want is LATIN CAPITAL LETTER I WITH DOT ABOVE, which is \u0130.
That suggests the problem is on your php side. If I had to guess, I'd suspect that the İ in your source file is not itself in the encoding that php expects. You may need to pass to "from" encoding to mb_convert_encoding depending on what encoding your editor is using.
I would strongly recommend that you stay in UTF-8 entirely if possible, and avoid creating a CP1254 (Turkish) string at all. UTF-8 is capable of encoding all the characters you need. In that case, you may be able to avoid the mb_convert_encoding entirely.

java.lang.NumberFormatException or java.nio.BufferUnderflowException when transforming bytes

I played around with some String -> byte -> binary code and I want my code to work for any byte[] array, currently it only works for, I am not sure ascii?
chinese DONT WORK.
String message =" 汉语";
playingWithFire(message.getBytes());
while String wow = "WOW..."; Works :( I want it to work for all utf-8 formates. Any pointers on how I can do it?
//thanks
public static byte[] playingWithFire(byte[] bytes){
byte[] newbytes = null;
newbytes = new byte[bytes.length];
for(int i = 0; i < bytes.length; i++){
String tempStringByte = String.format("%8s", Integer.toBinaryString(bytes[i] & 0xFF)).replace(' ', '0');
StringBuffer newByteBrf = null;
newByteBrf = new StringBuffer();
for(int x = 0; x < tempStringByte.length(); x++){
newByteBrf.append(tempStringByte.charAt(x));
}
/*short a = Short.parseShort(newByteBrf.toString(), 2);
ByteBuffer bytesads = ByteBuffer.allocate(2).putShort(a);
newbytes[i] = bytesads.get();
cause: java.nio.BufferUnderflowException
*/
//cause: java.lang.NumberFormatException: Value out of range.
newbytes[i] = Byte.parseByte(newByteBrf.toString(), 2);
}
return newbytes;
}
message.getBytes() in your case is trying to convert Chinese Unicode characters to bytes using the default character set on your computer. If its a western charset, its going to be wrong.
Notice that String.getBytes() has another form with String.getBytes(String) where the string is the name of a character encoding that is used to convert the chars of the string to bytes.
The char type will hold Unicode. The byte type only holds raw bits in groups of 8.
So, to convert a Unicode string to bytes encoded as UTF-16 you would use this code:
String message =" 汉语";
byte[] utf16Bytes = message.getBytes("utf-16");
Substitute the name of any encoding that you want to use.
Similarly new String(String, byte[]) constructor can take an array of bytes encoded in some fashion and, given the String, can convert those bytes to Unicode characters.
For example: If you want to convert those bytes, which were encoded as utf-16 above, back to a String (which has Unicode chars in it):
String newMessage = new String(utf16Bytes, "utf-16");
Since I don't know what you mean by "binary code" above, I can't go much farther. As I see it, the Unicode chars have a binary code inside them that represents the characters one-by-one. Also the byte array has a binary code in it that represents the characters with a many-bytes-to-one-character representation. If you want to encrypt the byte array somehow, use a standard, proven encryption method and proven, time-tested procedures to secure the contents.

Laravel 4 cyrillic slug

Currently in L4 you can't get slug from cyrillic string. In L3 there was an ascii array for that. Where and how can I add this array/ability to create a slug from cyrillic string?
EDIT
The library https://github.com/cocur/slugify is a good option, but I decided to use in L4 a custom Slug library from L3 methods and ascii array. Now I have in L4 working Slug maker just like in L3.
You can install this library (https://github.com/cocur/slugify) via composer and use.
It's super easy to install and use.
I have faced this problem when I was working with Arabic language, so I've made the following function which solved the problem for me.
function make_slug($string = null, $separator = "-") {
if (is_null($string)) {
return "";
}
// Remove spaces from the beginning and from the end of the string
$string = trim($string);
// Lower case everything
// using mb_strtolower() function is important for non-Latin UTF-8 string | more info: http://goo.gl/QL2tzK
$string = mb_strtolower($string, "UTF-8");;
// Make alphanumeric (removes all other characters)
// this makes the string safe especially when used as a part of a URL
// this keeps latin characters and arabic charactrs as well
$string = preg_replace("/[^a-z0-9_\s-ءاأإآؤئبتثجحخدذرزسشصضطظعغفقكلمنهويةى]/u", "", $string);
// Remove multiple dashes or whitespaces
$string = preg_replace("/[\s-]+/", " ", $string);
// Convert whitespaces and underscore to the given separator
$string = preg_replace("/[\s_]/", $separator, $string);
return $string;
}
This function solves the problem only for Arabic language, if you want to solve the problem for Cyrillic or any other language, you need to add Cyrillic characters (or the other language's characters) beside or instead of these ءاأإآؤئبتثجحخدذرزسشصضطظعغفقكلمنهويةى existing Arabic characters.

Resources