How to determine if a character is a Chinese character

How to determine if a character is a Chinese character - ruby

How to determine if a character is a Chinese character using ruby？

Ruby 1.9
#encoding: utf-8
"漢" =~ /\p{Han}/

An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also)
I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and Korean characters (some characters are shared between them) - not sure if you can distinguish which are Chinese only.
I think you can check if it's a CJK character by calling this on string str and character with index n:
def check_char(str, n)
list_of_chars = str.unpack("U*")
char = list_of_chars[n]
#main blocks
if char >= 0x4E00 && char <= 0x9FFF
return true
end
#extended block A
if char >= 0x3400 && char <= 0x4DBF
return true
end
#extended block B
if char >= 0x20000 && char <= 0x2A6DF
return true
end
#extended block C
if char >= 0x2A700 && char <= 0x2B73F
return true
end
return false
end

Related

Golang string end character

I'm a newbie in Golang, so I'm playing with some algorithms and i have a little problem.
In java for insert an end string in char array I can do like this:
String str = "Mr John Smith ";
char[] arr = str.toCharArray();
arr[12] = '\0';
But in Golang I'm trying like this:
str := []byte("Mr John Smith ")
str[12] = '\0'
But this code didn't work

That's not a valid syntax for a rune literal with a 0 value. You can use the hex escape sequence
str[12] = '\x00'
If you really need an octal value, it requires 3 digits
str[12] = '\000'
Or just assign a literal 0
str[12] = 0
You can see the valid rune literal escape sequences in the specification: https://golang.org/ref/spec#Rune_literals

Replace decimal numbers in string by hex equivalent

I have a huge set of strings like:
// Register 10:
typedef struct RegAddr_10
{
uint8 something: 4;
uint8 something_else: 4;
} tRegAddr_10;
and want to convert all register addresses (given in decimal numbers) to hexadecimal. Other numbers can occur within each typedef; therefore I have to consider the Reg part as a kind of delimiter. The example should result in:
// Register 0x0A:
typedef struct RegAddr_0x0A
{
uint8 something: 4;
uint8 something_else: 4;
} tRegAddr_0x0A;
My solution is this:
class String
def convert_base(from, to)
self.to_i(from).to_s(to)
end
end
new = text.gsub(/Reg\D*\d*/) do |number|
number.gsub(/(\d+)/) {'0x'+$1.convert_base(10,16)}
end
It works, but:
(How) is it possible to do this with one gsub only?
How can I make the conversion generate 2-digit-hex numbers in upper case, e.g. 10 → 0x0A, not 0xa?

Code
R = /
(?: # begin non-capture group
^ # match beginning of line
\/{2}\s+Register\s+ # match string
| # or
\s+RegAddr_ # match string
| # or
\s+tRegAddr_ # match string
) # close non-capture group
\K # discard everything matched so far
\d+ # match >= 1 digits
/x # extended mode
def replace_with_hex(str)
str.gsub(R) { |s| "0x%02X" % s }
end
The part of the format string for String#% that follows the percent character is: 0, meaning pad left with zeros, 2 for field width and X to convert to hex with letters A-F in capitals (x for lower case).
Example
str = <<_
// Register 10:
typedef struct RegAddr_10
{
uint8 something: 4;
uint8 something_else: 4;
} tRegAddr_10;
_
puts replace_with_hex(str)
prints:
// Register 0x0A:
typedef struct RegAddr_0x0A
{
uint8 something: 4;
uint8 something_else: 4;
} tRegAddr_0x0A;
Alternatives
If you are less fussy:
R = /
[\s|t] # match whitespace or t
Reg\D+ # match string
\K # discard everything matched so far
\d+ # match >= 1 digits
/x # extended mode
works as well.
You could also change the operative line of replace_with_hex to:
str.gsub(R, "0x%02X" % $~[0])

UTF-8 encoding by characters bigger then UTF-8 upper range

I'm working on a translation of uft-8 encoding code from C# into C.
UFT8 covers the range of character values from 0x0000 to 0x7FFFFFFF (http://en.wikipedia.org/wiki/UTF-8).
Encoding function in C# file encodes for example the character 'ñ' without problem.
this character 'ñ' has the value FFFFFFF1 in hex in my sample program when I look it on memory window in VS 2005.
But the character 'ñ' in Windows-Symbol-table has the hex value of 0xF1.
Now, in my sample program, I verify the characters in the string and find the highest range of UTF-8 to determin which Utf8 encoding range should be used for encoding.
Such:
"charToAnalyse" is here a character of a string::
{
char utfMode = 0;
char utf8EncoderMode = 0;
if(charToAnalyse >= 0x0000 && charToAnalyse <= 0x007F)
{utfMode =1;}
else if(charToAnalyse >= 0x0080 && charToAnalyse <= 0x07FF)
{utfMode =2;}
else if(charToAnalyse >= 0x0800 && charToAnalyse <= 0xFFFF)
{utfMode =3;}
else if(charToAnalyse >= 0x10000 && charToAnalyse <= 0x1FFFFF)
{utfMode =4;}
else if(charToAnalyse >= 0x200000 && charToAnalyse <= 0x3FFFFFF)
{utfMode =5;}
else if(charToAnalyse >= 0x4000000 && charToAnalyse <= 0x7FFFFFFF)
{utfMode =6;}
...
...
...
if(utfMode > utf8EncoderMode)
{
utf8EncoderMode = utfMode;
}
in this function utfMode=0 for the character 'ñ', because ñ == 0xFFFFFFF1, and can not be classified with the codes above.
MY QUESTION HERE İS:
1) Is it true that ñ has the value of 0xFFFFFFF1? If 'yes' how cat it be classified for UTF8 encoding? Is it possible a character has a value bigger then U+7FFFFFFF (0x7FFFFFFF)?
2) Is this somehow related with the term of "low-surrogate" of "high-surrogate"?
Thanks a lot, even it's an absurd question :)

It sounds very much as though you're reading signed bytes (is your input in ISO 8859-1 perchance?): your bytes are being interpreted as being in the range -128..127 rather than 0..255, and your value that should be 0xf1 (241) is being read as -15 instead, which is 0xfffffff1 in twos-complement. In C, "char" is often signed by default[1]; you should be using "unsigned char".
Unicode does not go as far up as 0xfffffff1, which is why UTF-8 does not provide an encoding for such code points.
[1] To be precise, "char" is distinct from both "signed char" and "unsigned char". But it can behave as either unsigned or signed, and which you get is implementation-defined.

I would like to explain this issue but Joni was the first :)
#Joni : You are perfectly right.
As I initiate the intager string as:
int charToAnalyseStr[50]= {'a', 0x7FFFFFFF, 'ñ', 'ş', 1};
the initiating of the e.g. this third member ñ occures as fallows:
giving member as 'ñ' understood by system as signed char (1byte).
'ñ' has a value of (-15) as signed char, this equals 241 as unsigned char!
So the value of (-15) is giving as an element of string by initiating.
the value of (-15) translated into signed intager normally as 0(dec) - 15(dec) = 0xFFFFFFF1 (hex)
the solution is here, what found is:
int charToAnalyseStr[50]= {(unsigned char)'a', 0x7FFFFFFF, (unsigned char)'ñ', 1};
So the charToAnalyseStr[2] appairs in memort window as 0x000000F1 :)
Thanks for your brain storm!

Converting Decimal to ASCII Character

I am trying to convert an decimal number to it's character equivalent. For example:
int j = 65 // The character equivalent would be 'A'.
Sorry, forgot to specify the language. I thought I did. I am using the Cocoa/Object-C. It is really frustrating. I have tried the following but it is still not converting correctly.
char_num1 = [working_text characterAtIndex:i]; // value = 65
char_num2 = [working_text characterAtIndex:i+1]; // value = 75
char_num3 = char_num1 + char_num2; // value = 140
char_str1 = [NSString stringWithFormat:#"%c",char_num3]; // mapped value = 229
char_str2 = [char_str2 stringByAppendingString:char_str1];
When char_num1 and char_num2 are added, I get the new ascii decimal value. However, when I try to convert the new decimal value to a character, I do not get the character that is mapped to char_num3.

Convert a character to a number in C:
int j = 'A';
Convert a number to a character in C:
char ch = 65;
Convert a character to a number in python:
j = ord('A')
Convert a number to a character in Python:
ch = chr(65)

Most languages have a 'char' function, so it would be Char(j)

I'm not sure what language you're asking about. In Java, this works:
int a = 'a';

It's quite often done with "chr" or "char", but some indication of the language / platform would be useful :-)
string k = Chr(j);

Code Golf: Email Address Validation without Regular Expressions

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
(Edit: What is Code Golf: Code Golf are challenges to solve a specific problem with the shortest amount of code by character count in whichever language you prefer. More info here on Meta StackOverflow. )
Code Golfers, here's a challenge on string operations.
Email Address Validation, but without regular expressions (or similar parsing library) of course. It's not so much about the email addresses but how short you can write the different string operations and constraints given below.
The rules are the following (yes, I know, this is not RFC compliant, but these are going to be the 5 rules for this challenge):
At least 1 character out of this group before the #:
A-Z, a-z, 0-9, . (period), _ (underscore)
# has to exist, exactly one time
john#smith.com
^
Period (.) has to exist exactly one time after the #
john#smith.com
^
At least 1 only [A-Z, a-z] character between # and the following . (period)
john#s.com
^
At least 2 only [A-Z, a-z] characters after the final . period
john#smith.ab
^^
Please post the method/function only, which would take a string (proposed email address) and then return a Boolean result (true/false) depending on the email address being valid (true) or invalid (false).
Samples:
b#w.org (valid/true) #w.org (invalid/false)
b#c#d.org (invalid/false) test#org (invalid/false)
test#%.org (invalid/false) s%p#m.org (invalid/false)
j_r#x.c.il (invalid/false) j_r#x.mil (valid/true)
r..t#x.tw (valid/true) foo#a%.com (invalid/false)
Good luck!

C89 (166 characters)
#define B(c)isalnum(c)|c==46|c==95
#define C(x)if(!v|*i++-x)return!1;
#define D(x)for(v=0;x(*i);++i)++v;
v;e(char*i){D(B)C(64)D(isalpha)C(46)D(isalpha)return!*i&v>1;}
Not re-entrant, but can be run multiple times. Test bed:
#include<stdio.h>
#include<assert.h>
main(){
assert(e("b#w.org"));
assert(e("r..t#x.tw"));
assert(e("j_r#x.mil"));
assert(!e("b#c#d.org"));
assert(!e("test#%.org"));
assert(!e("j_r#x.c.il"));
assert(!e("#w.org"));
assert(!e("test#org"));
assert(!e("s%p#m.org"));
assert(!e("foo#a%.com"));
puts("success!");
}

J
:[[/%^(:[[+-/^,&i|:[$[' ']^j+0__:k<3:]]

C89, 175 characters.
#define G &&*((a+=t+1)-1)==
#define H (t=strspn(a,A
t;e(char*a){char A[66]="_.0123456789Aa";short*s=A+12;for(;++s<A+64;)*s=s[-1]+257;return H))G 64&&H+12))G 46&&H+12))>1 G 0;}
I am using the standard library function strspn(), so I feel this answer isn't as "clean" as strager's answer which does without any library functions. (I also stole his idea of declaring a global variable without a type!)
One of the tricks here is that by putting . and _ at the start of the string A, it's possible to include or exclude them easily in a strspn() test: when you want to allow them, use strspn(something, A); when you don't, use strspn(something, A+12). Another is assuming that sizeof (short) == 2 * sizeof (char), and building up the array of valid characters 2 at a time from the "seed" pair Aa. The rest was just looking for a way to force subexpressions to look similar enough that they could be pulled out into #defined macros.
To make this code more "portable" (heh :-P) you can change the array-building code from
char A[66]="_.0123456789Aa";short*s=A+12;for(;++s<A+64;)*s=s[-1]+257;
to
char*A="_.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
for a cost of 5 additional characters.

Python (181 characters including newlines)
def v(E):
import string as t;a=t.ascii_letters;e=a+"1234567890_.";t=e,e,"#",e,".",a,a,a,a,a,"",a
for c in E:
if c in t[0]:t=t[2:]
elif not c in t[1]:return 0>1
return""==t[0]
Basically just a state machine using obfuscatingly short variable names.

C (166 characters)
#define F(t,u)for(r=s;t=(*s-64?*s-46?isalpha(*s)?3:isdigit(*s)|*s==95?4:0:2:1);++s);if(s-r-1 u)return 0;
V(char*s){char*r;F(2<,<0)F(1=)F(3=,<0)F(2=)F(3=,<1)return 1;}
The single newline is required, and I've counted it as one character.

Python, 149 chars (after putting the whole for loop into one semicolon-separated line, which I haven't done here for "readability" purposes):
def v(s,t=0,o=1):
for c in s:
k=c=="#"
p=c=="."
A=c.isalnum()|p|(c=="_")
L=c.isalpha()
o&=[A,k|A,L,L|p,L,L,L][t]
t+=[1,k,1,p,1,1,0][t]
return(t>5)&o
Test cases, borrowed from strager's answer:
assert v("b#w.org")
assert v("r..t#x.tw")
assert v("j_r#x.mil")
assert not v("b#c#d.org")
assert not v("test#%.org")
assert not v("j_r#x.c.il")
assert not v("#w.org")
assert not v("test#org")
assert not v("s%p#m.org")
assert not v("foo#a%.com")
print "Yeah!"
Explanation: When iterating over the string, two variables keep getting updated.
t keeps the current state:
t = 0: We're at the beginning.
t = 1: We where at the beginning and have found at least one legal character (letter, number, underscore, period)
t = 2: We have found the "#"
t = 3: We have found at least on legal character (i.e. letter) after the "#"
t = 4: We have found the period in the domain name
t = 5: We have found one legal character (letter) after the period
t = 6: We have found at least two legal characters after the period
o as in "okay" starts as 1, i.e. true, and is set to 0 as soon as a character is found that is illegal in the current state.
Legal characters are:
In state 0: letter, number, underscore, period (change state to 1 in any case)
In state 1: letter, number, underscore, period, at-sign (change state to 2 if "#" is found)
In state 2: letter (change state to 3)
In state 3: letter, period (change state to 4 if period found)
In states 4 thru 6: letter (increment state when in 4 or 5)
When we have gone all the way through the string, we return whether t==6 (t>5 is one char less) and o is 1.

Whatever version of C++ MSVC2008 supports.
Here's my humble submission. Now I know why they told me never to do the things I did in here:
#define N return 0
#define I(x) &&*x!='.'&&*x!='_'
bool p(char*a) {
if(!isalnum(a[0])I(a))N;
char*p=a,*b=0,*c=0;
for(int d=0,e=0;*p;p++){
if(*p=='#'){d++;b=p;}
else if(*p=='.'){if(d){e++;c=p;}}
else if(!isalnum(*p)I(p))N;
if (d>1||e>1)N;
}
if(b>c||b+1>=c||c+2>=p)N;
return 1;
}

Not the greatest solution no doubt, and pretty darn verbose, but it is valid.
Fixed (All test cases pass now)
static bool ValidateEmail(string email)
{
var numbers = "1234567890";
var uppercase = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
var lowercase = uppercase.ToLower();
var arUppercase = uppercase.ToCharArray();
var arLowercase = lowercase.ToCharArray();
var arNumbers = numbers.ToCharArray();
var atPieces = email.Split(new string[] { "#"}, StringSplitOptions.RemoveEmptyEntries);
if (atPieces.Length != 2)
return false;
foreach (var c in atPieces[0])
{
if (!(arNumbers.Contains(c) || arLowercase.Contains(c) || arUppercase.Contains(c) || c == '.' || c == '_'))
return false;
}
if(!atPieces[1].Contains("."))
return false;
var dotPieces = atPieces[1].Split('.');
if (dotPieces.Length != 2)
return false;
foreach (var c in dotPieces[0])
{
if (!(arLowercase.Contains(c) || arUppercase.Contains(c)))
return false;
}
var found = 0;
foreach (var c in dotPieces[1])
{
if ((arLowercase.Contains(c) || arUppercase.Contains(c)))
found++;
else
return false;
}
return found >= 2;
}

C89 character set agnostic (262 characters)
#include <stdio.h>
/* the 'const ' qualifiers should be removed when */
/* counting characters: I don't like warnings :) */
/* also the 'int ' should not be counted. */
/* it needs only 2 spaces (after the returns), should be only 2 lines */
/* that's a total of 262 characters (1 newline, 2 spaces) */
/* code golf starts here */
#include<string.h>
int v(const char*e){
const char*s="0123456789._abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
if(e=strpbrk(e,s))
if(e=strchr(e+1,'#'))
if(!strchr(e+1,'#'))
if(e=strpbrk(e+1,s+12))
if(e=strchr(e+1,'.'))
if(!strchr(e+1,'.'))
if(strlen(e+1)>1)
return 1;
return 0;
}
/* code golf ends here */
int main(void) {
const char *t;
t = "b#w.org"; printf("%s ==> %d\n", t, v(t));
t = "r..t#x.tw"; printf("%s ==> %d\n", t, v(t));
t = "j_r#x.mil"; printf("%s ==> %d\n", t, v(t));
t = "b#c#d.org"; printf("%s ==> %d\n", t, v(t));
t = "test#%.org"; printf("%s ==> %d\n", t, v(t));
t = "j_r#x.c.il"; printf("%s ==> %d\n", t, v(t));
t = "#w.org"; printf("%s ==> %d\n", t, v(t));
t = "test#org"; printf("%s ==> %d\n", t, v(t));
t = "s%p#m.org"; printf("%s ==> %d\n", t, v(t));
t = "foo#a%.com"; printf("%s ==> %d\n", t, v(t));
return 0;
}
Version 2
Still C89 character set agnostic, bugs hopefully corrected (303 chars; 284 without the #include)
#include<string.h>
#define Y strchr
#define X{while(Y
v(char*e){char*s="0123456789_.abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
if(*e!='#')X(s,*e))e++;if(*e++=='#'&&!Y(e,'#')&&Y(e+1,'.'))X(s+12,*e))e++;if(*e++=='.'
&&!Y(e,'.')&&strlen(e)>1){while(*e&&Y(s+12,*e++));if(!*e)return 1;}}}return 0;}
That #define X is absolutely disgusting!
Test as for my first (buggy) version.

VBA/VB6 - 484 chars
Explicit off
usage: VE("b#w.org")
Function V(S, C)
V = True
For I = 1 To Len(S)
If InStr(C, Mid(S, I, 1)) = 0 Then
V = False: Exit For
End If
Next
End Function
Function VE(E)
VE = False
C1 = "abcdefghijklmnopqrstuvwxyzABCDEFGHILKLMNOPQRSTUVWXYZ"
C2 = "0123456789._"
P = Split(E, "#")
If UBound(P) <> 1 Then GoTo X
If Len(P(0)) < 1 Or Not V(P(0), C1 & C2) Then GoTo X
E = P(1): P = Split(E, ".")
If UBound(P) <> 1 Then GoTo X
If Len(P(0)) < 1 Or Not V(P(0), C1) Or Len(P(1)) < 2 Or Not V(P(1), C1) Then GoTo X
VE = True
X:
End Function

Java: 257 chars (not including the 3 end of lines for readability ;-)).
boolean q(char[]s){int a=0,b=0,c=0,d=0,e=0,f=0,g,y=-99;for(int i:s)
d=(g="#._0123456789QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm".indexOf(i))<0?
y:g<1&&++e>0&(b<1|++a>1)?y:g==1&e>0&(c<1||f++>0)?y:++b>0&g>12?f>0?d+1:f<1&e>0&&++c>0?
d:d:d;return d>1;}
Passes all the tests (my older version was incorrect).

Erlang 266 chars:
-module(cg_email).
-export([test/0]).
%%% golf code begin %%%
-define(E,when X>=$a,X=<$z;X>=$A,X=<$Z).
-define(I(Y,Z),Y([X|L])?E->Z(L);Y(_)->false).
-define(L(Y,Z),Y([X|L])?E;X>=$0,X=<$9;X=:=$.;X=:=$_->Z(L);Y(_)->false).
?L(e,m).
m([$#|L])->a(L);?L(m,m).
?I(a,i).
i([$.|L])->l(L);?I(i,i).
?I(l,c).
?I(c,g).
g([])->true;?I(g,g).
%%% golf code end %%%
test() ->
true = e("b#w.org"),
false = e("b#c#d.org"),
false = e("test#%.org"),
false = e("j_r#x.c.il"),
true = e("r..t#x.tw"),
false = e("test#org"),
false = e("s%p#m.org"),
true = e("j_r#x.mil"),
false = e("foo#a%.com"),
ok.

Ruby, 225 chars.
This is my first Ruby program, so it's probably not very Ruby-like :-)
def v z;r=!a=b=c=d=e=f=0;z.chars{|x|case x when'#';r||=b<1||!e;e=!1 when'.'
e ?b+=1:(a+=1;f=e);r||=a>1||(c<1&&!e)when'0'..'9';b+=1;r|=!e when'A'..'Z','a'..'z'
e ?b+=1:f ?c+=1:d+=1;else r=1 if x!='_'||!e|!b+=1;end};!r&&d>1 end

'Using no regex':
PHP 47 Chars.
<?=filter_var($argv[1],FILTER_VALIDATE_EMAIL);

Haskell (GHC 6.8.2), 165 161 144C Characters
Using pattern matching, elem, span and all:
a=['A'..'Z']++['a'..'z']
e=f.span(`elem`"._0123456789"++a)
f(_:_,'#':d)=g$span(`elem`a)d
f _=False
g(_:_,'.':t#(_:_:_))=all(`elem`a)t
g _=False
The above was tested with the following code:
main :: IO ()
main = print $ and [
e "b#w.org",
e "r..t#x.tw",
e "j_r#x.mil",
not $ e "b#c#d.org",
not $ e "test#%.org",
not $ e "j_r#x.c.il",
not $ e "#w.org",
not $ e "test#org",
not $ e "s%p#m.org",
not $ e "foo#a%.com"
]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to determine if a character is a Chinese character - ruby

How to determine if a character is a Chinese character using ruby？

Ruby 1.9 #encoding: utf-8 "漢" =~ /\p{Han}/

Related

Golang string end character

Replace decimal numbers in string by hex equivalent

UTF-8 encoding by characters bigger then UTF-8 upper range

Converting Decimal to ASCII Character

Code Golf: Email Address Validation without Regular Expressions

Categories

Resources