How to convert EBCDIC with chinese chars to UTF-8 format - utf-8

I have a requirement to convert a file with EBCDIC encoding which is encoded using the IBM937 code page to UTF-8 format for loading the file into a multi-byte enabled DB2 database.
I have tried unix recode and iconv. None of them has the ability to convert IBM 937 to UTF8. I'm looking for any utility (java, perl, unix ) in this world which can do that on a unix based system. Can someone help me here?
SL

Take a look at ICU (International Components for Unicode): http://site.icu-project.org/
It has a converter for IBM-937: http://demo.icu-project.org/icu-bin/convexp?conv=ibm-937_P110-1999&s=ALL
CU is a mature, widely used set of
C/C++ and Java libraries providing
Unicode and Globalization support for
software applications. ICU is widely
portable and gives applications the
same results on all platforms and
between C/C++ and Java software. ICU
is released under a nonrestrictive
open source license that is suitable
for use with both commercial software
and with other open source or free
software.
Here are a few highlights of the
services provided by ICU:
Code Page Conversion: Convert text
data to or from Unicode and nearly any
other character set or encoding. ICU's
conversion tables are based on charset
data collected by IBM over the course
of many decades, and is the most
complete available anywhere.
Collation: Compare strings according
to the conventions and standards of a
particular language, region or
country. ICU's collation is based on
the Unicode Collation Algorithm plus
locale-specific comparison rules from
the Common Locale Data Repository, a
comprehensive source for this type of
data.
Formatting: Format numbers, dates,
times and currency amounts according
the conventions of a chosen locale.
This includes translating month and
day names into the selected language,
choosing appropriate abbreviations,
ordering fields correctly, etc. This
data also comes from the Common Locale
Data Repository.
Time Calculations: Multiple types of
calendars are provided beyond the
traditional Gregorian calendar. A
thorough set of timezone calculation
APIs are provided.
Unicode Support: ICU closely tracks
the Unicode standard, providing easy
access to all of the many Unicode
character properties, Unicode
Normalization, Case Folding and other
fundamental operations as specified by
the Unicode Standard.
Regular Expression: ICU's regular
expressions fully support Unicode
while providing very competitive
performance.
Bidi: support for handling text
containing a mixture of left to right
(English) and right to left (Arabic or
Hebrew) data.
Text Boundaries: Locate the positions
of words, sentences, paragraphs within
a range of text, or identify locations
that would be suitable for line
wrapping when displaying the text.
And much more. Refer to the ICU User Guide for details.

It appears that Java can convert the IBM937 code page to UTF-8.
You would specify the input format as "cp937".
Here are two methods from the Oracle page on Character and Byte Streams:
static String readInput() {
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis,
"cp937");
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
}
in.close();
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
and
static void writeOutput(String str) {
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}

Related

UTF8mb4 unicode breaking MariaDB JDBC driver

I have some product names that include unicode characters
⚠️📷PLEASE READ! WORKING KODAK DC215 ZOOM 1.0MP DIGITAL CAMERA - UK
SELLER
A query in heidiSQL shows it fine
I setup MariaDB new this morning having moved from MySQL, but when records are retrieved through a ColdFusion Query using the MariaDB JDBC I get
java.lang.StringIndexOutOfBoundsException: begin 0, end 80, length 74
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3410)
at java.base/java.lang.String.substring(String.java:1883)
at org.mariadb.jdbc.internal.com.read.resultset.rowprotocol.TextRowProtocol.getInternalString(TextRowProtocol.java:238)
at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.getString(SelectResultSet.java:948)
The productname field collation is utf8mb4_unicode_520_ci, I've tried a few options. I've tried to set this at table and database level where it let me.
The JDBC connection string in ColdFusion admin is jdbc:mysql://localhost:3307/usedlens?useUnicode=true&characterEncoding=UTF-8
I note that the live production database where MariaDB was used from the beginning I don't have this trouble but the default charset is latin1, and the same record is the database as
????PLEASE READ! WORKING KODAK DC215 ZOOM 1.0MP DIGITAL CAMERA - UK SELLER
Here's how we've been stripping high ASCII characters while retaining any characters that may be salvaged:
string function ASCIINormalize(string inputString=""){
return createObject( 'java', 'java.text.Normalizer' ).normalize( javacast("string", arguments.inputString) , createObject( 'java', 'java.text.Normalizer$Form' ).valueOf('NFD') ).replaceAll('\p{InCombiningDiacriticalMarks}+','').replaceAll('[^\p{ASCII}]+','');
}
productname = ASCIINormalize(productname);
/*
Comparisons using java UDF versus reReplace regex:
"ABC Café ’test" (note: High ASCII non-normal whitespace characters used.)
ASCIINormalize = "ABC Cafe test"
reReplace = "ABC Caf test"
"čeština"
ASCIINormalize = "cestina"
reReplace = "etina"
"Häuser Bäume Höfe Gärten"
ASCIINormalize = "Hauser Baume Hofe Garten"
reReplace = "Huser Bume Hfe Grten"
*/
This is due to a sequence of high ASCII characters that form emojis. I encountered similar issues when exporting MSSQL data to a UTF-8 file to be converted to Excel using a 3rd party tool. In this case, the database and file were correct, but the 3rd party tool would crash when encountering emoji characters.
Our approach to this was to convert emojis to their aliases so that information wasn't lost in the process. (If you strip high ASCII characters, you may lose some context.) To sanitize emojis to use aliases, I wrote this ColdFusion cf-emoji-java (CFC) to leverage emoji-java (JAR file) to convert emojis to their ASCII7-safe aliases.
emojijava = new emojijava();
emojijava.parseToAliases('I like 🍕'); // I like :pizza:
Since...
I'm not really in the business of supporting emojis
My data is just product names targeted at UK, Europe and the United States for the foreseeable future
I don't want to have to go through the same trouble with production (already defaulted to latin1_swedish_ci)
I decided to..
Match production, so I set the database, table, and fields to latin1_swedish_ci with help from
How to change the CHARACTER SET (and COLLATION) throughout a database?
and strip non ASCII characters in the product name
== edit don't do this, it takes out too many useful characters ==
<cfset productname = reReplace(productname, "[^\x20-\x7E]", "", "ALL")>

New iText Producer field causes validation failure

I switched from the old iText library to the iTextPdf library and noticed a problem. The new library sets the producer to a value that includes non-Unicode characters (windows TM symbol and copyright symbol). The problem is that validation programs that read this text choke on these characters.
Can I get iText to fix this (w/o paying for a license)? I am ok with iText getting credit. I just want the credits to be Unicode clean.
<</Producer(iText® 5.5.0 ©2000-2013 iText Group NV \(AGPL-version\))/ModDate(D:20150126155550-07'00')/CreationDate(D:20150126155550-07'00')>>
You are looking at the document information dictionary of a PDF, more exactly at the value of its Producer entry. It is specified as:
Producer text string (Optional) If the document was converted to PDF from another format, the name of the conforming product that converted it to PDF.
(Table 317 – Entries in the document information dictionary)
So the value must have the type text string. This in turn is specified as:
The text string type shall be used for character strings that shall be encoded in either PDFDocEncoding or the UTF-16BE Unicode character encoding scheme. PDFDocEncoding can encode all of the ISO Latin 1 character set and is documented in Annex D.
(section 7.9.2.2 Text String Type)
In Annex D you find:
CHAR CODE (OCTAL)
CHAR NAME STD MAC WIN PDF
...
© copyright — 251 251 251
...
® registered — 250 256 256
...
(D.2 Latin Character Set and Encodings)
Thus, these characters are completely valid here and validators which choke on these characters are broken.
So you had better report this bug to the developers of the validators in question.

Does V8 have Unicode support?

I'm using v8 to use JavaScript in native(c++) code. To call a Javascript function I need to convert all the parameters to v8 data types.
For eg: Code to convert char* to v8 data type
char* value;
...
v8::String::New(value);
Now, I need to pass unicode chars(wchar_t) to JavaScript.
First of all does v8 supports Unicode chars? If yes, how to convert wchar_t/std::wstring to v8 data type?
I'm not sure if this was the case at the time this question was asked, but at the moment the V8 API has a number of functions which support UTF-8, UTF-16 and Latin-1 encoded text:
https://github.com/v8/v8/blob/master/include/v8.h
The relevant functions to create new string objects are:
String::NewFromUtf8 (UTF-8 encoded, obviously)
String::NewFromOneByte (Latin-1 encoded)
String::NewFromTwoByte (UTF-16 encoded)
Alternatively, you can avoid copying the string data and construct a V8 string object that refers to existing data (whose lifecycle you control):
String::NewExternalOneByte (Latin-1 encoded)
String::NewExternalTwoByte (UTF-16 encoded)
Unicode just maps Characters to Number. What you need is proper encoding, like UTF8 or UTF-16.
V8 seems to support UTF-8 (v8::String::WriteUtf8) and a not further described 16bit type (Write). I would give it a try and write some UTF-16 into it.
In unicode applications, windows stores UTF-16 in std::wstring. Maybe you try something like
std::wstring yourString;
v8::String::New (yourString.c_str());
No it doesn't have unicode support, the above solution is fine.
The following code did the trick
wchar_t path[1024] = L"gokulestás";
v8::String::New((uint16_t*)path, wcslen(path))

mongodb/gridfs java-driver use with utf-8 meta data

I am trying to use GridFS to load a file along with some meta data
using the java-driver. (2.5.3)
Things work fine as long as the meta-data is in ASCII. But I get an
exception - the moment I try and set a UTF8 string with non ascii
characters.
String MetaData = "学海";
GridFS gridFS = new GridFS(db);
GridFSInputFile inputFile = myFS.createFile(new File(filePath));
DBObject dbObj = inputFile.getMetaData()
dbObj.put("metaData", MetaData); ----> Get exception here (if non- ascii data)
inputFile.save();
Are you able to use UTF8 strings when storing regular documents?
Based on your description, it sounds like you're trying to report a bug rather than ask a question.
MongoDB uses a JIRA system for reporting bugs. If you can include the code you are using this will help the driver developer correct the issue.

What kind of strings does CFStringCreateWithFormat expects as arguments?

The below example should work with Unicode strings but it doesn't.
CFStringRef aString = CFSTR("one"); // in real life this is an Unicode string
CFStringRef formatString = CFSTR("This is %s example"); // also tried %S but without success
CFStringRef resultString = CFStringCreateWithFormat(NULL, NULL, formatString, aString);
// Here I should have a valid sentence in resultString but the current result is like aString would contain garbage.
Use %# if you want to include a CFStringRef via CFStringCreateWithFormat.
See the Format Specifiers section of Strings Programming Guide for Core Foundation.
%# is for Objective C objects, OR CFTypeRef objects (CFStringRef is compatible with CFTypeRef)
%s is for a null-terminated array of 8-bit unsigned characters (i.e. normal C strings).
%S is for a null-terminated array of 16-bit Unicode characters.
A CFStringRef object is not the same as “a null-terminated array of 16-bit Unicode characters”.
As an answer to the comment in the other answer, I would recommend the poster to
generate a UTF8 string in a portable way into char*
and, at the last minute, convert it to CFString using CFStringCreateWithCString with kCFStringEncodingUTF8 as the encoding.
Please, please do not use %s in CFStringCreateWithFormat. Please do not rely on the "system encoding", which is MacRoman on Western European environments, but not in other languages. The concept of the system encoding is inherently brain-dead, especially in east Asian environments (which I came from) where even the characters inside ASCII code range (below 127!) is modified. Hell breaks loose if you rely on "system encoding". Fortunately, since 10.4, all of the methods which use "system encoding" are now deprecated, except %s... .
I'm sorry I write this much for this small topic, but it was a real pity a few years ago when there were many nice apps which didn't work on Japanese/Korean Macs because of just this "system encoding." Please refer to this detailed explanation which I wrote a few years ago, if you're interested.

Resources