Note: this question could look odd on systems not supporting the included emoji.
This is a follow-up question to How do I remove emoji from string.
I want to build a regular expression that matches all emoji that can be entered in Mac OS X / iOS.
The obvious Unicode blocks cover most, but not all of these emoji:
U+1F300..U+1F5FF Miscellaneous Symbols And Pictographs
U+1F600..U+1F64F Emoticons
U+1F650..U+1F67F Ornamental Dingbats
U+1F680..U+1F6FF Transport and Map Symbols
Wikipedia provides a compiled list of all the symbols available in Apple Color Emoji on OS X Mountain Lion and iOS 6, which looks like a good starting point: (slightly updated)
people = '๐๐๐๐โบ๏ธ๐๐๐๐๐๐๐๐๐๐ณ๐๐๐๐๐๐ฃ๐ข๐๐ญ๐ช๐ฅ๐ฐ๐
๐๐ฉ๐ซ๐จ๐ฑ๐ ๐ก๐ค๐๐๐๐ท๐๐ด๐ต๐ฒ๐๐ฆ๐ง๐๐ฟ๐ฎ๐ฌ๐๐๐ฏ๐ถ๐๐๐๐ฒ๐ณ๐ฎ๐ท๐๐ถ๐ฆ๐ง๐จ๐ฉ๐ด๐ต๐ฑ๐ผ๐ธ๐บ๐ธ๐ป๐ฝ๐ผ๐๐ฟ๐น๐พ๐น๐บ๐๐๐๐๐ฝ๐ฉ๐ฅโจ๐๐ซ๐ฅ๐ข๐ฆ๐ง๐ค๐จ๐๐๐๐
๐๐๐๐๐โโ๐โ๐๐๐๐๐๐๐โ๐๐ช๐ถ๐๐๐ซ๐ช๐ฌ๐ญ๐๐๐ฏ๐๐
๐๐๐๐๐
๐ฐ๐๐๐๐ฉ๐๐๐๐๐ก๐ ๐ข๐๐๐๐๐ฝ๐๐๐๐ผ๐๐๐๐๐๐๐๐๐๐๐โค๐๐๐๐๐๐๐๐๐๐๐๐ค๐ฅ๐ฌ๐ฃ๐ญ'
nature = '๐ถ๐บ๐ฑ๐ญ๐น๐ฐ๐ธ๐ฏ๐จ๐ป๐ท๐ฝ๐ฎ๐๐ต๐๐ด๐๐๐ผ๐ง๐ฆ๐ค๐ฅ๐ฃ๐๐๐ข๐๐๐๐๐๐๐๐ ๐๐ฌ๐ณ๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐๐ฒ๐ก๐๐ซ๐ช๐๐๐ฉ๐พ๐๐ธ๐ท๐๐น๐ป๐บ๐๐๐๐ฟ๐พ๐๐ต๐ด๐ฒ๐ณ๐ฐ๐ฑ๐ผ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ โญโโ
โโกโโโ๐๐๐๐'
objects = '๐๐๐๐๐๐๐๐๐๐๐๐ป๐
๐๐๐๐๐๐๐๐ฎ๐ฅ๐ท๐น๐ผ๐ฟ๐๐ฝ๐พ๐ป๐ฑโ๐๐๐ ๐ก๐บ๐ป๐๐๐๐๐๐๐ข๐ฃโณโโฐโ๐๐๐๐๐๐๐ก๐ฆ๐๐
๐๐๐๐๐๐ฟ๐ฝ๐ง๐ฉ๐จ๐ช๐ฌ๐ฃ๐ซ๐ช๐๐๐ฐ๐ด๐ต๐ท๐ถ๐ณ๐ธ๐ฒ๐ง๐ฅ๐คโ๐ฉ๐จ๐ฏ๐ซ๐ช๐ฌ๐ญ๐ฎ๐ฆ๐๐๐๐๐๐๐๐๐๐
๐๐๐๐โ๐๐โโ๐๐๐๐๐๐๐๐๐๐๐๐๐๐ฌ๐ญ๐ฐ๐จ๐ฌ๐ค๐ง๐ผ๐ต๐ถ๐น๐ป๐บ๐ท๐ธ๐พ๐ฎ๐๐ด๐๐ฒ๐ฏ๐๐โฝโพ๐พ๐ฑ๐๐ณโณ๐ต๐ด๐๐๐๐ฟ๐๐๐๐ฃโ๐ต๐ถ๐ผ๐บ๐ป๐ธ๐น๐ท๐ด๐๐๐๐๐๐๐๐ค๐ฑ๐ฃ๐ฅ๐๐๐๐๐ฒ๐ข๐ก๐ณ๐๐ฉ๐ฎ๐ฆ๐จ๐ง๐๐ฐ๐ช๐ซ๐ฌ๐ญ๐ฏ๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐๐
๐ฝ'
places = '๐ ๐ก๐ซ๐ข๐ฃ๐ฅ๐ฆ๐ช๐ฉ๐จ๐โช๐ฌ๐ค๐๐๐ฏ๐ฐโบ๐ญ๐ผ๐พ๐ป๐๐
๐๐ฝ๐๐ ๐กโฒ๐ข๐ขโต๐ค๐ฃโ๐โ๐บ๐๐๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐จ๐๐๐๐๐๐ฒ๐ก๐๐ ๐๐๐๐ซ๐ฆ๐ฅโ ๐ง๐ฐโฝ๐ฎ๐ฐโจ๐ฟ๐ช๐ญ๐๐ฉ๐ฏ๐ต๐ฐ๐ท๐ฉ๐ช๐จ๐ณ๐บ๐ธ๐ซ๐ท๐ช๐ธ๐ฎ๐น๐ท๐บ๐ฌ๐ง'
symbols = '1๏ธโฃ2๏ธโฃ3๏ธโฃ4๏ธโฃ5๏ธโฃ6๏ธโฃ7๏ธโฃ8๏ธโฃ9๏ธโฃ0๏ธโฃ๐๐ข#๏ธโฃ๐ฃโฌ๏ธโฌ๏ธโฌ
๏ธโก๏ธ๐ ๐ก๐คโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธ๐โ๏ธโถ๏ธ๐ผ๐ฝโฉ๏ธโช๏ธโน๏ธโชโฉโซโฌโคต๏ธโคด๏ธ๐๐๐๐๐๐๐๐๐๐ถ๐ฆ๐๐ฏ๐ณ๐ต๐ด๐ฒ๐๐น๐บ๐ถ๐๐ป๐น๐บ๐ผ๐พ๐ฐ๐ฎ๐
ฟ๏ธโฟ๏ธ๐ญ๐ท๐ธ๐โ๏ธ๐๐๐
๐๐ใ๏ธใ๏ธ๐๐๐๐ซ๐๐ต๐ฏ๐ฑ๐ณ๐ท๐ธโโณ๏ธโ๏ธโโ
โด๏ธ๐๐๐ณ๐ด๐
ฐ๐
ฑ๐๐
พ๐ โฟโป๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๐ฏ๐ง๐น๐ฒ๐ฑยฉ๏ธยฎ๏ธโข๏ธโโผ๏ธโ๏ธโโโโโญ๐๐๐๐๐๐๐๐ง๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐ข๐ฃ๐ค๐ฅ๐ฆโ๏ธโโโโ โฅโฃโฆ๐ฎ๐ฏโโ๐๐โฐใฐใฝ๏ธ๐ฑโผ๏ธโป๏ธโพ๏ธโฝ๏ธโช๏ธโซ๏ธ๐บ๐ฒ๐ณโซ๏ธโช๏ธ๐ด๐ต๐ปโฌ๏ธโฌ๏ธ๐ถ๐ท๐ธ๐น'
emoji = people + nature + objects + places + symbols # all emoji combined
Most characters have a single code point and converting these would be easy:
๐ U+1F600 (Grinning Face)
But some characters are "encoded using two Unicode values":
โบ๏ธ U+263A U+FE0F (White Smiling Face, Variation Selector 16)
๐ฏ๐ต U+1F1EF U+1F1F5 (Regional Indicator Symbol Letter J / Regional Indicator Symbol Letter P)
โฌ๏ธ U+2B1B U+FE0F (Black Large Square / Variation Selector 16)
And some even have 3 codepoints:
๏ธโฃ U+0023 U+FE0F U+20E3 (Number Sign / Variation Selector 16 / Combining Enclosing Keycap)
(Variation Selector 16 means "emoji style")
How can I split this list into characters (without splitting combined characters), find their code point(s) and finally build a regular expression matching them?
The regex doesn't have to respect "missing" characters within larger blocks, i.e. it's okay if the 4 Unicode blocks mentioned above are entirely covered.
(I'm going to answer this myself if I don't get any answers, but maybe there's an easy solution)
The upcoming Unicode Emoji data files would help with this. At the moment these are still drafts, but they might still help you out.
By parsing http://www.unicode.org/Public/emoji/1.0/emoji-data.txt you could get quite easily get a list of all emoji in the Unicode standard. (Note that some of these emoji consist of multiple code points.) Once you have such a list, itโs trivial to turn it into a regular expression.
Hereโs a JavaScript version: https://github.com/mathiasbynens/emoji-regex/blob/master/index.js And hereโs the script that generates it based on the data from emoji-data.txt: https://github.com/mathiasbynens/emoji-regex/blob/master/scripts/generate-regex.js
This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
Examples can be found here: https://stackoverflow.com/a/29115920/1911674
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.
Related
I use DT_WORDBREAK flag when I call DrawTextEx. About this flag MSDN says:
Lines are automatically broken between words if a word extends past
the edge of the rectangle specified by the lprc parameter. A carriage
return-line feed sequence also breaks the line.
But I cannot find "official" list of symbols that are used as word break symbols. Is it exist?
If you get the TEXTMETRICs for the font you're using, it corresponds to the tmBreakChar field.
For any Latin font, this is almost certainly just the plain old space character (Unicode U+0020 SPACE or ASCII 32).
I don't think DrawTextEx does anything fancier. You'd have to use a more advanced API to get more sophisticated behavior such as breaking after hyphens, soft-hyphens, other kinds of spaces, etc.
I tagged character-encoding and text because I know if you type 'and' == 'and' into the rails console, or most any other programming language, you will get true. However, I am having the issue when one of my users pastes his text into my website, I can't spell check it properly or verify it's originality via copyscape because of some issue with the text. (or maybe my understanding of text encoding?)
EXAMPLE:
If you copy and paste the following line into the rails console you will get false.
'ะฐnd' == 'and' #=> false
If you copy and paste the following line into the rails console you will get true even though they appear exactly the same in the browser.
'and' == 'and' #=> true
The difference is, in the first example, the first 'ะฐnd' is copied and pasted from my user's text that is causing the issues. All the other instances of 'and' are typed into the browser.
Is this an encoding issue?
How to fix my issue?
This isnโt really an encoding problem, in the first case the strings compare as false simply because they are different.
The first character of the first string isnโt a โnormalโ a, it is actually U+0430 CYRILLIC SMALL LETTER A โ the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a โnormalโ Latin a, which is U+0061 LATIN SMALL LETTER A.
Hereโs the โnormalโ a: a, and this is the Cyrillic a: ะฐ, they appear pretty much identical.
The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.
You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find youโre too strict about conversions.
Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Rubyโs regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.
The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.
The problem is also not with normalization or folding.
The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.
These three codepoints resolve to the following three characters:
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.
These three codepoints resolve to the following three characters:
LATIN SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.
This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.
What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:
Where does that cyrillic a come from?
Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.
This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.
I think it is eccoding issue, you can have a try like this.
irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"
irb(main):011:0> 'ะฐnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "ะฐnd"
In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]โ๏ธ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
โโโฝโ๏ธโ๏ธ
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.
The symbol being printed โ๏ธ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: โ๏ธ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[โฆ\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.
This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '
I am trying to come to terms with how a barcode is decoded and generated by a scanner.
A note from the client says the following generated bar code consists of extra characters:
Generated Code: |2389299920014}
Extra Characters: Apparently the first two and last three characters are not part of the bar code.
Question
Are the extra characters attached by the bar code reader (therefore dependent on the scanner) or are they an intrinsic part of the barcode?
Here is a sample image of a barcode:
http://imageshack.us/a/img824/1862/dm6x.jpg
Thanks
[SOLVED] My apologies. This was just another one of those cases of 'shooting your mouth off' without doing proper research.
Solution The code is EAN13. The prefix and suffix are probably scanner dependent. The 13 digits in between are as follows (first digit from the left) Check Sum (Next 9 digits) Company Id + Item Id (Last 3 Digits ) GS1 prefix
It's hard to answer without understanding what format you are trying to encode, what the intended contents are, and what the purported contents are.
Some formats add extra information as part of the encoding process, but it does not become part of the content. When correctly encoded and decoded, the output should match the input exactly.
Barcodes encode what they encode and there is no data that is somehow part of the barcode but not somehow encoded in it.
EAN-13 has no scanner-dependent considerations, no. The encoding and decoding of a given number is the same everywhere. EAN-13 encodes 13 digits, so I am not sure what the 13 digits "in between" mean.
You mention GS1, which is something else. A family of barcodes in fact. You'd have to say what specifically you are using. The GS1 encodings are likewise not ambiguous or scanner-dependent. You know what you want to encode, you encode it exactly, it's read exactly.
I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('ยต')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it โ say, the currency symbols โฌยฃยฅ$ (plus a trailing newline) โ running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that โฌ is U+20AC, ยฃ is U+00A3, and ยฅ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table โ or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy โ it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
โฌ encodes to "\u20AC".
ยฃ encodes to "\u00A3".
ยฅ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of ๐๐พ ั ัฬ, the output would be:
๐ encodes to "\u{1F64B}".
๐พ encodes to "\u{1F3FE}".
ั encodes to "\u045E".
ั encodes to "\u0443". ฬ
encodes to "\u0306".
This is because ๐๐พ is actually encoded as two code points: a base character (๐ - U+1F64B), with a modifier (๐พ, U+1F3FE; see also). Similarly with one of the letters: the first, ั, is a single pre-combined code point (U+045E), while the second, ัฬ โ though it looks the same โ is formed by combining ั (U+0443) with the modifier ฬ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for โ, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for ๐. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character ๐ plus the modifier ๐พ, thus ultimately yielding the abstract character ๐๐พ (as seen above).
This works with shorter code points, too. For example, that currency character string from above (โฌยฃยฅ$) could be represented with \u{20AC A3 A5 24} โ requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use รค, วน, รบ and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100ูช".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.