In Python, I can use \N to print unicode characters.
print('\N{White Smiling Face}')
Will print ☺
Is there an way to do the same thing in Go? Couldn't find anything in unicode or x/text.
Related
I’m printing a parameter returned from a query that’s a string of letters and underscores.
The label prints just the letters without the underscores, and I’m not sure how to fix it.
^FD<String>^FS
^FH^FD<String>^FS
Thank you very much.
(Removing the FH Only reads to the first underscore.
The ^FH command without parameter defaults to underscore as the hexidecimal escape character. Either remove the ^FH or specify a different escape character like backslash using ^FH\^FD<String>^FS.
Here is the Unicode characters table for the Tibetan language,
https://en.m.wikipedia.org/wiki/Tibetan_(Unicode_block)
How to I use the codes in that chart in a fmt.Printf(mycode) statement, in order to print, say the Tibetan letter ཏ, which is located at line U+0F4x and column F of that unicode chart.
Do I have to write:
Fmt.Printf(“U+0F4xF”)
or something like that, or do I have to drop the “U” or the “U+“ ?
To print ཏ (U+0F4F TIBETAN LETTER TA) (or any other Unicode character), you can put the character directly into your string literal, use a \u0F4F escape, or use the correspoding rune (Unicode codepoint):
fmt.Printf("Direct: ཏ\n")
fmt.Printf("Escape: \u0F4F\n")
fmt.Printf("Rune: %c\n", rune(0x0F4F))
The Go blog has some details...
I am using Arduino with OPEN-SMART Touch Screen Expansion Shield, which uses Adafruit_GFX library. I would need to print characters from whole UTF-8 as for example letters with diactritics, Greek letters and so on. If I try to print these characters with default font, it prints some nonsense. What should I do?
I need to match emojis in a string in Ruby using a regex. I have tried several unicode sequences and none seem to quite do the job. I am also not sure where the start and end range for emojis would be.
This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
I generated this regex directly from the raw list of Unicode emoji. The algorithm is here: https://github.com/franklsf95/ruby-emoji-regex.
Example usage:
regex = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
str = "I am a string with emoji 😍😍😱😱👿👿🐔🌚 and other Unicode characters 比如中文."
str.gsub regex, ''
# "I am a string with emoji and other Unicode characters 比如中文."
Other Unicode characters, such as Asian characters, are preserved.
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.
Emojis don't exist in one single range. They are scattered about. This is a collection of codes, and ranges where possible, that will match emojis. Tested in ruby 2.0.0p451:
str = "😣"
str.scan(/[\u{00A9}\u{00AE}\u{203C}\u{2049}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{27BF}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F31F}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}\u{1F68C}-\u{1F6C5}]/)
You can use the emoji_data gem to canonically match emoji in a string via it's .scan method: https://github.com/mroth/emoji_data.rb
(disclaimer: I am the author)
Some of the more recent Emoji need to be constructed by multiple Emoji-related codepoints, for example, using the invisible "Zero-width joiner" (U+200D) codepoint to construct so called Emoji ZWJ sequences. You can use my unicode-emoji gem, which comes with a regex, build from the latest Emoji data by the Unicode consortium.
I use the code below:
puts "matched" if "中国" =~ /\w+/
it puts "matched" and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".
Could somebody give me some clues?
I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:
\w
Matches any word character. For
non-Unicode and ECMAScript
implementations, this is the same as
[a-zA-Z_0-9]. In Unicode categories,
this is the same as
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
So it's not the case that \w necessarily just means [a-zA-Z_0-9] - it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.
This still makes it different from . though, as \w wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.
As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc} does match, you can see on a Unicode reference list:
\p{Ll} Lowercase Unicode letter
\p{Lu} Uppercase Unicode letter
\p{Lt} Titlecase Unicode letter
\p{Lo} Other Unicode letter
\p{Nd} Decimal, number
\p{Pc} "Punctuation, connector"
Oniguruma, which is the regex engine in Ruby 1.9+, defines \w as:
[\w] word character
Not Unicode:
* alphanumeric, "_" and multibyte char.
Unicode:
* General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In 1.9+, Ruby knows if the string has Unicode characters, and automatically switches to use Unicode mode for pattern matching.