UTF-8 encoding by characters bigger then UTF-8 upper range - utf-8

I'm working on a translation of uft-8 encoding code from C# into C.
UFT8 covers the range of character values from 0x0000 to 0x7FFFFFFF (http://en.wikipedia.org/wiki/UTF-8).
Encoding function in C# file encodes for example the character 'ñ' without problem.
this character 'ñ' has the value FFFFFFF1 in hex in my sample program when I look it on memory window in VS 2005.
But the character 'ñ' in Windows-Symbol-table has the hex value of 0xF1.
Now, in my sample program, I verify the characters in the string and find the highest range of UTF-8 to determin which Utf8 encoding range should be used for encoding.
Such:
"charToAnalyse" is here a character of a string::
{
char utfMode = 0;
char utf8EncoderMode = 0;
if(charToAnalyse >= 0x0000 && charToAnalyse <= 0x007F)
{utfMode =1;}
else if(charToAnalyse >= 0x0080 && charToAnalyse <= 0x07FF)
{utfMode =2;}
else if(charToAnalyse >= 0x0800 && charToAnalyse <= 0xFFFF)
{utfMode =3;}
else if(charToAnalyse >= 0x10000 && charToAnalyse <= 0x1FFFFF)
{utfMode =4;}
else if(charToAnalyse >= 0x200000 && charToAnalyse <= 0x3FFFFFF)
{utfMode =5;}
else if(charToAnalyse >= 0x4000000 && charToAnalyse <= 0x7FFFFFFF)
{utfMode =6;}
...
...
...
if(utfMode > utf8EncoderMode)
{
utf8EncoderMode = utfMode;
}
in this function utfMode=0 for the character 'ñ', because ñ == 0xFFFFFFF1, and can not be classified with the codes above.
MY QUESTION HERE İS:
1) Is it true that ñ has the value of 0xFFFFFFF1? If 'yes' how cat it be classified for UTF8 encoding? Is it possible a character has a value bigger then U+7FFFFFFF (0x7FFFFFFF)?
2) Is this somehow related with the term of "low-surrogate" of "high-surrogate"?
Thanks a lot, even it's an absurd question :)

It sounds very much as though you're reading signed bytes (is your input in ISO 8859-1 perchance?): your bytes are being interpreted as being in the range -128..127 rather than 0..255, and your value that should be 0xf1 (241) is being read as -15 instead, which is 0xfffffff1 in twos-complement. In C, "char" is often signed by default[1]; you should be using "unsigned char".
Unicode does not go as far up as 0xfffffff1, which is why UTF-8 does not provide an encoding for such code points.
[1] To be precise, "char" is distinct from both "signed char" and "unsigned char". But it can behave as either unsigned or signed, and which you get is implementation-defined.

I would like to explain this issue but Joni was the first :)
#Joni : You are perfectly right.
As I initiate the intager string as:
int charToAnalyseStr[50]= {'a', 0x7FFFFFFF, 'ñ', 'ş', 1};
the initiating of the e.g. this third member ñ occures as fallows:
giving member as 'ñ' understood by system as signed char (1byte).
'ñ' has a value of (-15) as signed char, this equals 241 as unsigned char!
So the value of (-15) is giving as an element of string by initiating.
the value of (-15) translated into signed intager normally as 0(dec) - 15(dec) = 0xFFFFFFF1 (hex)
the solution is here, what found is:
int charToAnalyseStr[50]= {(unsigned char)'a', 0x7FFFFFFF, (unsigned char)'ñ', 1};
So the charToAnalyseStr[2] appairs in memort window as 0x000000F1 :)
Thanks for your brain storm!

Related

C Program Strange Characters retrieved due to language setting on Windows

If the below code is compiled with UNICODE as compiler option, the GetComputerNameEx API returns junk characters.
Whereas if compiled without UNICODE option, the API returns truncated value of the hostname.
This issue is mostly seen with Asia-Pacific languages like Chinese, Japanese, Korean to name a few (i.e., non-English).
Can anyone throw some light on how this issue can be resolved.
# define INFO_SIZE 30
int main()
{
int ret;
TCHAR infoBuf[INFO_SIZE+1];
DWORD bufSize = (INFO_SIZE+1);
char *buf;
buf = (char *) malloc(INFO_SIZE+1);
if (!GetComputerNameEx((COMPUTER_NAME_FORMAT)1,
(LPTSTR)infoBuf, &bufSize))
{
printf("GetComputerNameEx failed (%d)\n", GetLastError());
return -1;
}
ret = wcstombs(buf, infoBuf, (INFO_SIZE+1));
buf[INFO_SIZE] = '\0';
return 0;
}
In the languages you mentioned, most characters are represented by more than one byte. This is because these languages have alphabets of much more than 256 characters. So you may need more than 30 bytes to encode 30 characters.
The usual pattern for calling a function like wcstombs goes like this: first get the amount of bytes required, then allocate a buffer, then convert the string.
(edit: that actually relies on a POSIX extension, which also got implemented on Windows)
size_t size = wcstombs(NULL, infoBuf, 0);
if (size == (size_t) -1) {
// some character can't be converted
}
char *buf = new char[size + 1];
size = wcstombs(buf, infoBuf, size + 1);

stored hex values in notepad file with .ini extension how to read it in hex only via CAPL

I have stored hex values in a text file with .ini extension along with address. But when i read it, it will not be in hex format it will be in character so is there any way to read value as hex and store it in byte in C language or in CAPL script?
I assume that you know how to read a text file in CAPL...
You can convert a hex string to a number using strtol(char s[], long result&):long. See the CAPL help (CAPL Function Overview -> General -> strol):
The number base is
haxadecimal if the string starts with "0x"
octal if the string starts with "0"
decimal otherwise
Whitespace (space or tabs) at the start of the staring are ignored.
Example:
on start
{
long number1, number2;
strtol("0xFF", number1);
strtol("-128", number2);
write("number1 = %d", number1);
write("number2 = %d", number2);
}
Output:
number1 = 255
number2 = -128
See also: strtoll(), strtoul(), strtoull(), strtod() and atol()
Update:
If the hex string does not start with "0x"...
on message 0x200
{
if (this.byte(0) == hextol("38"))
write("byte(0) == 56");
}
long hextol(char s[])
{
long res;
char xs[8];
strncpy(xs, "0x", elcount(xs)); // cpy "0x" to 'xs'
strncat(xs, s, elcount(xs)); // cat 'xs' and 's'
strtol(xs, res); // convert to long
return res;
}

Parsing float from i2c with ruby on raspberry pi

this is mostly a ruby question.
I'm stuck trying to parse some bytes from an i2c device to a float value on ruby.
Long story:
I'm trying to read a float value from an i2c device with raspberry pi and an AtTiny85 (the device). i'm able to read its value from console through i2ctools
Example:
i2cset -y 0 0x25 0x00; sleep 1; i2cget -y 0 0x25 0x00; i2cget -y 0 0x25 0x00; i2cget -y 0 0x25 0x00; i2cget -y 0 0x25 0x00
Gives me:
0x3e
0x00
0x80
0x92
that means 0.12549046, which is a value in volts that i'm able to check with my multimeter and is ok. (the order of the bytes is 0x3e008092)
Now i need to get this float value from a ruby script, I'm using the i2c gem.
A comment on this site suggest the following conversion method:
hex_string = '42880000'
float = [hex_string.to_i(16)].pack('L').unpack('F')[0]
# => 68.0
float = 66.2
hex_string = [float].pack('F').unpack('L')[0].to_s(16)
# => 42846666
But i haven't been able to get this string of hex values. This fraction of code:
require "i2c/i2c"
require "i2c/backends/i2c-dev"
#i2c = ::I2C.create("/dev/i2c-0")
sharp = 0x25
#i2c.write(sharp, 0)
sleep 1
puts #i2c.read(sharp, 4).inspect
Puts on screen
">\x00\x00P"
Where the characters '>' and 'P' are the ASCII values of the byte in that position, but then i cannot know where/how to split the string and clean it up to at least try the method showed above.
I could write a C program to read the value and printf it to console or something and run it from ruby, but i think that would be an awful solution.
Some ideas on how can this be done would be very helpful!
Greetings.
I came with something:
bytes = []
for i in (0..3) do
bytes << #i2c.read_byte(sharp).unpack('*C*')[0].to_s(16)
bytes[i] = "00" unless bytes[i] != "0"
end
bytes = bytes.join.to_s
float = [bytes.to_i(16)].pack('L').unpack('F')[0]
puts float.to_s
Not shure about unpack(' * C * ') though, but it works. If it's a better way to do it i'd be glad with another answer.
Greetings!
You probably just need to use unpack with a format of g, or possible e depending on the endianness.
#i2c.read(sharp, 4).unpack('g')[0]
The example you are referring to is taking a string of hex digits and first converting it to a binary string (that’s the [hex_string.to_i(16)].pack('L') part) before converting to an integer (the L directive is for 32 bit integers). The data you have is already a binary string, so you just need to convert it directly with the appropriate directive for unpack.
Have a read of the documentation for unpack and pack.

How to determine if a character is a Chinese character

How to determine if a character is a Chinese character using ruby?
Ruby 1.9
#encoding: utf-8
"漢" =~ /\p{Han}/
An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also)
I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and Korean characters (some characters are shared between them) - not sure if you can distinguish which are Chinese only.
I think you can check if it's a CJK character by calling this on string str and character with index n:
def check_char(str, n)
list_of_chars = str.unpack("U*")
char = list_of_chars[n]
#main blocks
if char >= 0x4E00 && char <= 0x9FFF
return true
end
#extended block A
if char >= 0x3400 && char <= 0x4DBF
return true
end
#extended block B
if char >= 0x20000 && char <= 0x2A6DF
return true
end
#extended block C
if char >= 0x2A700 && char <= 0x2B73F
return true
end
return false
end

Converting Decimal to ASCII Character

I am trying to convert an decimal number to it's character equivalent. For example:
int j = 65 // The character equivalent would be 'A'.
Sorry, forgot to specify the language. I thought I did. I am using the Cocoa/Object-C. It is really frustrating. I have tried the following but it is still not converting correctly.
char_num1 = [working_text characterAtIndex:i]; // value = 65
char_num2 = [working_text characterAtIndex:i+1]; // value = 75
char_num3 = char_num1 + char_num2; // value = 140
char_str1 = [NSString stringWithFormat:#"%c",char_num3]; // mapped value = 229
char_str2 = [char_str2 stringByAppendingString:char_str1];
When char_num1 and char_num2 are added, I get the new ascii decimal value. However, when I try to convert the new decimal value to a character, I do not get the character that is mapped to char_num3.
Convert a character to a number in C:
int j = 'A';
Convert a number to a character in C:
char ch = 65;
Convert a character to a number in python:
j = ord('A')
Convert a number to a character in Python:
ch = chr(65)
Most languages have a 'char' function, so it would be Char(j)
I'm not sure what language you're asking about. In Java, this works:
int a = 'a';
It's quite often done with "chr" or "char", but some indication of the language / platform would be useful :-)
string k = Chr(j);

Resources