Powerbuilder: ImportFile of UTF-8 (Converting UTF-8 to ANSI) - utf-8

My Powerbuilder version is 6.5, cannot use a higher version as this is what I am supporting.
My problem is, when I am doing dw_1.ImportFile(file) the first row and first column has a funny string like this:

Which I dont understand until I tried opening the file and saving it to a new text file and trying to import that new file.which worked flawlessly without the funny string.
My conclusion is that this is happening because the file is UTF-8 (as shown in NOTEPAD++) and the new file is Ansi. The file I am trying to import is automatically given by a 3rd party and my users dont want the extra job of doing this.
How do I force convert this files to ANSI in powerbuilder. If there is none, I might have to do a command prompt conversion, any ideas?

The weird  characters are the (optional) utf-8 BOM that tells editors that the file is utf-8 encoded (as it can be difficult to know it unless we encounter an escaped character above code 127). You cannot just rid it off because if your file contains any character above 127 (accents or any special char), you will still have garbage in your displayed data (for example: é -> é, € -> €, ...) where special characters will become from 2 to 4 garbage chars.
I recently needed to convert some utf-8 encoded string to "ansi" windows 1252 encoding. With version of PB10+, a reencoding between utf-8 and ansi is as simple as
b = blob(s, encodingutf8!)
s2 = string(b, encodingansi!)
But string() and blob() do not support encoding specification before the release 10 of PB.
What you can do is to read the file yourself, skip the BOM, ask Windows to convert the string encoding via MultiByteToWideChar() + WideCharToMultiByte() and load the converted string in the DW with ImportString().
Proof of concept to get the file contents (with this reading method, the file cannot be bigger than 2GB):
string ls_path, ls_file, ls_chunk, ls_ansi
ls_path = sle_path.text
int li_file
if not fileexists(ls_path) then return
li_file = FileOpen(ls_path, streammode!)
if li_file > 0 then
FileSeek(li_file, 3, FromBeginning!) //skip the utf-8 BOM
//read the file by blocks, FileRead is limited to 32kB
do while FileRead(li_file, ls_chunk) > 0
ls_file += ls_chunk //concatenate in loop works but is not so performant
loop
FileClose(li_file)
ls_ansi = utf8_to_ansi(ls_file)
dw_tab.importstring( text!, ls_ansi)
end if
utf8_to_ansi() is a globlal function, it was written for PB9, but it should work the same with PB6.5:
global type utf8_to_ansi from function_object
end type
type prototypes
function ulong MultiByteToWideChar(ulong CodePage, ulong dwflags, ref string lpmultibytestr, ulong cchmultibyte, ref blob lpwidecharstr, ulong cchwidechar) library "kernel32.dll"
function ulong WideCharToMultiByte(ulong CodePage, ulong dwFlags, ref blob lpWideCharStr, ulong cchWideChar, ref string lpMultiByteStr, ulong cbMultiByte, ref string lpUsedDefaultChar, ref boolean lpUsedDefaultChar) library "kernel32.dll"
end prototypes
forward prototypes
global function string utf8_to_ansi (string as_utf8)
end prototypes
global function string utf8_to_ansi (string as_utf8);
//convert utf-8 -> ansi
//use a wide-char native string as pivot
constant ulong CP_ACP = 0
constant ulong CP_UTF8 = 65001
string ls_wide, ls_ansi, ls_null
blob lbl_wide
ulong ul_len
boolean lb_flag
setnull(ls_null)
lb_flag = false
//get utf-8 string length converted as wide-char
setnull(lbl_wide)
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, 0)
//allocate buffer to let windows write into
ls_wide = space(ul_len * 2)
lbl_wide = blob(ls_wide)
//convert utf-8 -> wide char
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, ul_len)
//get the final ansi string length
setnull(ls_ansi)
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, 0, ls_null, lb_flag)
//allocate buffer to let windows write into
ls_ansi = space(ul_len)
//convert wide-char -> ansi
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, ul_len, ls_null, lb_flag)
return ls_ansi
end function

Related

C Program Strange Characters retrieved due to language setting on Windows

If the below code is compiled with UNICODE as compiler option, the GetComputerNameEx API returns junk characters.
Whereas if compiled without UNICODE option, the API returns truncated value of the hostname.
This issue is mostly seen with Asia-Pacific languages like Chinese, Japanese, Korean to name a few (i.e., non-English).
Can anyone throw some light on how this issue can be resolved.
# define INFO_SIZE 30
int main()
{
int ret;
TCHAR infoBuf[INFO_SIZE+1];
DWORD bufSize = (INFO_SIZE+1);
char *buf;
buf = (char *) malloc(INFO_SIZE+1);
if (!GetComputerNameEx((COMPUTER_NAME_FORMAT)1,
(LPTSTR)infoBuf, &bufSize))
{
printf("GetComputerNameEx failed (%d)\n", GetLastError());
return -1;
}
ret = wcstombs(buf, infoBuf, (INFO_SIZE+1));
buf[INFO_SIZE] = '\0';
return 0;
}
In the languages you mentioned, most characters are represented by more than one byte. This is because these languages have alphabets of much more than 256 characters. So you may need more than 30 bytes to encode 30 characters.
The usual pattern for calling a function like wcstombs goes like this: first get the amount of bytes required, then allocate a buffer, then convert the string.
(edit: that actually relies on a POSIX extension, which also got implemented on Windows)
size_t size = wcstombs(NULL, infoBuf, 0);
if (size == (size_t) -1) {
// some character can't be converted
}
char *buf = new char[size + 1];
size = wcstombs(buf, infoBuf, size + 1);

stored hex values in notepad file with .ini extension how to read it in hex only via CAPL

I have stored hex values in a text file with .ini extension along with address. But when i read it, it will not be in hex format it will be in character so is there any way to read value as hex and store it in byte in C language or in CAPL script?
I assume that you know how to read a text file in CAPL...
You can convert a hex string to a number using strtol(char s[], long result&):long. See the CAPL help (CAPL Function Overview -> General -> strol):
The number base is
haxadecimal if the string starts with "0x"
octal if the string starts with "0"
decimal otherwise
Whitespace (space or tabs) at the start of the staring are ignored.
Example:
on start
{
long number1, number2;
strtol("0xFF", number1);
strtol("-128", number2);
write("number1 = %d", number1);
write("number2 = %d", number2);
}
Output:
number1 = 255
number2 = -128
See also: strtoll(), strtoul(), strtoull(), strtod() and atol()
Update:
If the hex string does not start with "0x"...
on message 0x200
{
if (this.byte(0) == hextol("38"))
write("byte(0) == 56");
}
long hextol(char s[])
{
long res;
char xs[8];
strncpy(xs, "0x", elcount(xs)); // cpy "0x" to 'xs'
strncat(xs, s, elcount(xs)); // cat 'xs' and 's'
strtol(xs, res); // convert to long
return res;
}

Write a unicode CString to a file using WriteFile API

How can I write contents of a CString instance to a file opened by CreateFile using WriteFile Win32 API function?
Please note MFC is not used, and CString is used by including "atlstr.h"
edit: Can just I do
WriteFile(handle, cstr, cstr.GetLength(), &dwWritten, NULL);
or
WriteFile(handle, cstr, cstr.GetLength() * sizeof(TCHAR), &dwWritten, NULL);
?
With ATL it's like this:
CString sValue;
CStringW sValueW(sValue); // NOTE: CT2CW() can be used instead
CAtlFile File;
ATLENSURE_SUCCEEDED(File.Create(sPath, GENERIC_WRITE, ...));
static const BYTE g_pnByteOrderMark[] = { 0xFF, 0xFE }; // UTF-16, Little Endian
ATLENSURE_SUCCEEDED(File.Write(g_pnByteOrderMark, sizeof g_pnByteOrderMark));
ATLENSURE_SUCCEEDED(File.Write(sValueW, (DWORD) (sValueW.GetLength() * sizeof (WCHAR))));
It's byte order mark (BOM) which lets Notepad know that encoding is UTF-16.
You need to pick a text encoding, convert the string to that encoding, and write the text file accordingly. Simplest is ANSI, so basically you would use the T2CA macro to convert your TCHAR string to ansi, then dump the contents in a file. Something like (untested/compiled):
// assumes handle is already opened in an empty new file
void DumpToANSIFile(const CString& str, HANDLE hFile)
{
USES_CONVERSION;
PCSTR ansi = T2CA(str);
DWORD dwWritten;
WriteFile(hFile, ansi, strlen(ansi) * sizeof(ansi[0]), &dwWritten, NULL);
}
Because it's ANSI encoding though, it will only be readable on computers that have your same code page settings. For a more portable solution, use UTF-8 or UTF-16.
Converting to ANSI can cause problems with code pages, so it is not acceptable in many cases. Here is a function that saves unicode string to unicode text file:
void WriteUnicodeStringToFile(const CString& str, LPCWSTR FileName)
{
HANDLE f = CreateFileW(FileName, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
if (f == INVALID_HANDLE_VALUE) return; //failed
DWORD wr;
unsigned char Header[2]; //unicode text file header
Header[0] = 0xFF;
Header[1] = 0xFE;
WriteFile(f, Header, 2, &wr, NULL);
WriteFile(f, (LPCTSTR)str, str.GetLength() * 2, &wr, NULL);
CloseHandle(f);
}
Using:
CString str = L"This is a sample unicode string";
WriteUnicodeStringToFile(str, L"c:\\Sample.txt");
Notepad understands unicode text files.
I believe you need additional casting:
WriteFile(handle, (LPCVOID)(LPCTSTR)cstr, cstr.GetLength() * sizeof(TCHAR), &dwWritten, NULL);
Also make sure that you write proper byte order mark in the beginning so that Notepad can read it properly.

WideCharToMultiByte when is lpUsedDefaultChar true?

I am trying to understand WideCharToMultiByte and I was wondering when lpUsedDefaultChar would be set to be TRUE.
Here is a sample: What should be lpszW inorder for the flag to be set to be true?
lpszW = L”__WHAT SHOULD_BE_HERE__”;
int c = ??;
BOOL fUsedDefaultChar = false;
DWORD dwSize = WideCharToMultiByte(CP_ACP, 0, lpszW, c, myOutStr ,myOutLen, NULL, &fUsedDefaultChar);
http://msdn.microsoft.com/en-us/library/dd374130(VS.85).aspx
Any books/tutorials for understanding Unicode/UTF stuff would be great.
Thanks!
Anything that is not present in the current codepage will map to ? (by default) and UsedDefaultChar will be != FALSE.
Windows-1252 is probably the most common codepage and most of those characters map to the same value in unicode.
Take Ω (ohm) for example, it is probably not present in whatever your current codepage is and therefore will not map to a valid narrow character:
BOOL fUsedDefaultChar=FALSE;
DWORD dwSize;
char myOutStr[MAX_PATH];
WCHAR lpszW[10]=L"Hello";
*lpszW=0x2126; //ohm sign, you could also use the \u2126 syntax if your compiler supports it.
dwSize = WideCharToMultiByte(CP_ACP, 0, lpszW, -1, myOutStr ,MAX_PATH, NULL, &fUsedDefaultChar);
printf("%d %s\n",fUsedDefaultChar,myOutStr); //This prints "1 ?ello" on my system
The MSDN documentation is very clear about when lpUsedDefaultChar is set to TRUE:
lpDefaultChar [in] Optional. Pointer
to the character to use if a character
cannot be represented in the specified
code page. The application sets this
parameter to NULL if the function is
to use a system default value. To
obtain the system default character,
the application can call the GetCPInfo
or GetCPInfoEx function.
lpUsedDefaultChar [out] Optional.
Pointer to a flag that indicates if
the function has used a default
character in the conversion. The flag
is set to TRUE if one or more
characters in the source string cannot
be represented in the specified code
page. Otherwise, the flag is set to
FALSE. This parameter can be set to
NULL.
That does not leave much room for misunderstanding, in my opinion.

Converting LPCWSTR with WideCharToMultiByte. Need help

i have a function like this:
BOOL WINAPI MyFunction(HDC hdc, LPCWSTR text, UINT cbCount){
char AnsiBuffer[255];
int written = WideCharToMultiByte(CP_ACP, 0, text, cbCount, AnsiBuffer , 0, NULL, NULL);
if(written > -1) AnsiBuffer[written] = '\0';
if(written>0){
ofstream myfile;
myfile.open ("C:\\example.txt", ios::app);
myfile.write(AnsiBuffer, sizeof(AnsiBuffer));
myfile.write("\n", 1);
myfile.close();
}
....
When i display the input LPCWSTR text with MessageBoxW(), the text shows up fine. When i try to convert it to multibyte, the return value looks normal (ex: 22, 45, etc), but the result is strings of gibberish (ex ÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌ). Suggestions?
I see two problems;
1) You are passing '0' to WideCharToMultiByte for the size of the multibyte buffer. If you read the documents this results in the function returning the NUMBER of bytes needed but performing no actual conversion. (This is to allow you to subsequently allocate a buffer of the correct size and recall the function).
2) in file.write sizeof(AnsiBuffer) will result in 255 bytes being written regardless of what is in the buffer. sizeof is a compile-time calculation that returns the size of a variable. You should replace this with the 'written' variable that represents the length of the string.
You need to pass the length of the buffer to the API, instead of passing 0. When you pass 0, the function returns the required length of the buffer, but doesn't write to it. You're seeing the results of the uninitialized array.
Here's the right call, with the 255 in the right place:
int written = WideCharToMultiByte(CP_ACP, 0, text, cbCount, AnsiBuffer , 255, NULL, NULL);

Resources