Regular expression matching emoji in Mac OS X / iOS

Regular expression matching emoji in Mac OS X / iOS - ruby

Note: this question could look odd on systems not supporting the included emoji.
This is a follow-up question to How do I remove emoji from string.
I want to build a regular expression that matches all emoji that can be entered in Mac OS X / iOS.
The obvious Unicode blocks cover most, but not all of these emoji:
U+1F300..U+1F5FF Miscellaneous Symbols And Pictographs
U+1F600..U+1F64F Emoticons
U+1F650..U+1F67F Ornamental Dingbats
U+1F680..U+1F6FF Transport and Map Symbols
Wikipedia provides a compiled list of all the symbols available in Apple Color Emoji on OS X Mountain Lion and iOS 6, which looks like a good starting point: (slightly updated)
people = '😄😃😀😊☺️😉😍😘😚😗😙😜😝😛😳😁😔😌😒😞😣😢😂😭😪😥😰😅😓😩😫😨😱😠😡😤😖😆😋😷😎😴😵😲😟😦😧😈👿😮😬😐😕😯😶😇😏😑👲👳👮👷💂👶👦👧👨👩👴👵👱👼👸😺😸😻😽😼🙀😿😹😾👹👺🙈🙉🙊💀👽💩🔥✨🌟💫💥💢💦💧💤💨👂👀👃👅👄👍👎👌👊✊✌👋✋👐👆👇👉👈🙌🙏☝👏💪🚶🏃💃👫👪👬👭💏💑👯🙆🙅💁🙋💆💇💅👰🙎🙍🙇🎩👑👒👟👞👡👠👢👕👔👚👗🎽👖👘👙💼👜👝👛👓🎀🌂💄💛💙💜💚❤💔💗💓💕💖💞💘💌💋💍💎👤👥💬👣💭'
nature = '🐶🐺🐱🐭🐹🐰🐸🐯🐨🐻🐷🐽🐮🐗🐵🐒🐴🐑🐘🐼🐧🐦🐤🐥🐣🐔🐍🐢🐛🐝🐜🐞🐌🐙🐚🐠🐟🐬🐳🐋🐄🐏🐀🐃🐅🐇🐉🐎🐐🐓🐕🐖🐁🐂🐲🐡🐊🐫🐪🐆🐈🐩🐾💐🌸🌷🍀🌹🌻🌺🍁🍃🍂🌿🌾🍄🌵🌴🌲🌳🌰🌱🌼🌐🌞🌝🌚🌑🌒🌓🌔🌕🌖🌗🌘🌜🌛🌙🌍🌎🌏🌋🌌🌠⭐☀⛅☁⚡☔❄⛄🌀🌁🌈🌊'
objects = '🎍💝🎎🎒🎓🎏🎆🎇🎐🎑🎃👻🎅🎄🎁🎋🎉🎊🎈🎌🔮🎥📷📹📼💿📀💽💾💻📱☎📞📟📠📡📺📻🔊🔉🔈🔇🔔🔕📢📣⏳⌛⏰⌚🔓🔒🔏🔐🔑🔎💡🔦🔆🔅🔌🔋🔍🛁🛀🚿🚽🔧🔩🔨🚪🚬💣🔫🔪💊💉💰💴💵💷💶💳💸📲📧📥📤✉📩📨📯📫📪📬📭📮📦📝📄📃📑📊📈📉📜📋📅📆📇📁📂✂📌📎✒✏📏📐📕📗📘📙📓📔📒📚📖🔖📛🔬🔭📰🎨🎬🎤🎧🎼🎵🎶🎹🎻🎺🎷🎸👾🎮🃏🎴🀄🎲🎯🏈🏀⚽⚾🎾🎱🏉🎳⛳🚵🚴🏁🏇🏆🎿🏂🏊🏄🎣☕🍵🍶🍼🍺🍻🍸🍹🍷🍴🍕🍔🍟🍗🍖🍝🍛🍤🍱🍣🍥🍙🍘🍚🍜🍲🍢🍡🍳🍞🍩🍮🍦🍨🍧🎂🍰🍪🍫🍬🍭🍯🍎🍏🍊🍋🍒🍇🍉🍓🍑🍈🍌🍐🍍🍠🍆🍅🌽'
places = '🏠🏡🏫🏢🏣🏥🏦🏪🏩🏨💒⛪🏬🏤🌇🌆🏯🏰⛺🏭🗼🗾🗻🌄🌅🌃🗽🌉🎠🎡⛲🎢🚢⛵🚤🚣⚓🚀✈💺🚁🚂🚊🚉🚞🚆🚄🚅🚈🚇🚝🚋🚃🚎🚌🚍🚙🚘🚗🚕🚖🚛🚚🚨🚓🚔🚒🚑🚐🚲🚡🚟🚠🚜💈🚏🎫🚦🚥⚠🚧🔰⛽🏮🎰♨🗿🎪🎭📍🚩🇯🇵🇰🇷🇩🇪🇨🇳🇺🇸🇫🇷🇪🇸🇮🇹🇷🇺🇬🇧'
symbols = '1️⃣2️⃣3️⃣4️⃣5️⃣6️⃣7️⃣8️⃣9️⃣0️⃣🔟🔢#️⃣🔣⬆️⬇️⬅️➡️🔠🔡🔤↗️↖️↘️↙️↔️↕️🔄◀️▶️🔼🔽↩️↪️ℹ️⏪⏩⏫⏬⤵️⤴️🆗🔀🔁🔂🆕🆙🆒🆓🆖📶🎦🈁🈯🈳🈵🈴🈲🉐🈹🈺🈶🈚🚻🚹🚺🚼🚾🚰🚮🅿️♿️🚭🈷🈸🈂Ⓜ️🛂🛄🛅🛃🉑㊙️㊗️🆑🆘🆔🚫🔞📵🚯🚱🚳🚷🚸⛔✳️❇️❎✅✴️💟🆚📳📴🅰🅱🆎🅾💠➿♻️♈️♉️♊️♋️♌️♍️♎️♏️♐️♑️♒️♓️⛎🔯🏧💹💲💱©️®️™️❌‼️⁉️❗❓❕❔⭕🔝🔚🔙🔛🔜🔃🕛🕧🕐🕜🕑🕝🕒🕞🕓🕟🕔🕠🕕🕖🕗🕘🕙🕚🕡🕢🕣🕤🕥🕦✖️➕➖➗♠♥♣♦💮💯✔☑🔘🔗➰〰〽️🔱◼️◻️◾️◽️▪️▫️🔺🔲🔳⚫️⚪️🔴🔵🔻⬜️⬛️🔶🔷🔸🔹'
emoji = people + nature + objects + places + symbols # all emoji combined
Most characters have a single code point and converting these would be easy:
😀 U+1F600 (Grinning Face)
But some characters are "encoded using two Unicode values":
☺️ U+263A U+FE0F (White Smiling Face, Variation Selector 16)
🇯🇵 U+1F1EF U+1F1F5 (Regional Indicator Symbol Letter J / Regional Indicator Symbol Letter P)
⬛️ U+2B1B U+FE0F (Black Large Square / Variation Selector 16)
And some even have 3 codepoints:
️⃣ U+0023 U+FE0F U+20E3 (Number Sign / Variation Selector 16 / Combining Enclosing Keycap)
(Variation Selector 16 means "emoji style")
How can I split this list into characters (without splitting combined characters), find their code point(s) and finally build a regular expression matching them?
The regex doesn't have to respect "missing" characters within larger blocks, i.e. it's okay if the 4 Unicode blocks mentioned above are entirely covered.
(I'm going to answer this myself if I don't get any answers, but maybe there's an easy solution)

The upcoming Unicode Emoji data files would help with this. At the moment these are still drafts, but they might still help you out.
By parsing http://www.unicode.org/Public/emoji/1.0/emoji-data.txt you could get quite easily get a list of all emoji in the Unicode standard. (Note that some of these emoji consist of multiple code points.) Once you have such a list, it’s trivial to turn it into a regular expression.
Here’s a JavaScript version: https://github.com/mathiasbynens/emoji-regex/blob/master/index.js And here’s the script that generates it based on the data from emoji-data.txt: https://github.com/mathiasbynens/emoji-regex/blob/master/scripts/generate-regex.js

This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
Examples can be found here: https://stackoverflow.com/a/29115920/1911674
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.

Related

DT_WORDBREAK: list of word break symbols

I use DT_WORDBREAK flag when I call DrawTextEx. About this flag MSDN says:
Lines are automatically broken between words if a word extends past
the edge of the rectangle specified by the lprc parameter. A carriage
return-line feed sequence also breaks the line.
But I cannot find "official" list of symbols that are used as word break symbols. Is it exist?

If you get the TEXTMETRICs for the font you're using, it corresponds to the tmBreakChar field.
For any Latin font, this is almost certainly just the plain old space character (Unicode U+0020 SPACE or ASCII 32).
I don't think DrawTextEx does anything fancier. You'd have to use a more advanced API to get more sophisticated behavior such as breaking after hyphens, soft-hyphens, other kinds of spaces, etc.

Why 'аnd' == 'and' is false?

I tagged character-encoding and text because I know if you type 'and' == 'and' into the rails console, or most any other programming language, you will get true. However, I am having the issue when one of my users pastes his text into my website, I can't spell check it properly or verify it's originality via copyscape because of some issue with the text. (or maybe my understanding of text encoding?)
EXAMPLE:
If you copy and paste the following line into the rails console you will get false.
'аnd' == 'and' #=> false
If you copy and paste the following line into the rails console you will get true even though they appear exactly the same in the browser.
'and' == 'and' #=> true
The difference is, in the first example, the first 'аnd' is copied and pasted from my user's text that is causing the issues. All the other instances of 'and' are typed into the browser.
Is this an encoding issue?
How to fix my issue?

This isn’t really an encoding problem, in the first case the strings compare as false simply because they are different.
The first character of the first string isn’t a ”normal“ a, it is actually U+0430 CYRILLIC SMALL LETTER A — the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a “normal” Latin a, which is U+0061 LATIN SMALL LETTER A.
Here’s the “normal” a: a, and this is the Cyrillic a: а, they appear pretty much identical.
The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.
You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find you’re too strict about conversions.
Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Ruby‘s regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.

The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.
The problem is also not with normalization or folding.
The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.
These three codepoints resolve to the following three characters:
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.
These three codepoints resolve to the following three characters:
LATIN SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.
This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.
What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:
Where does that cyrillic a come from?
Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.
This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.

I think it is eccoding issue, you can have a try like this.
irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"
irb(main):011:0> 'аnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "аnd"

Terminal overwriting same line when too long

In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]☔︎ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
☂☃☽☀︎☔︎
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.

The symbol being printed ☔︎ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: ☔︉ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[…\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.

This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '

Decoded barcode extra digits

I am trying to come to terms with how a barcode is decoded and generated by a scanner.
A note from the client says the following generated bar code consists of extra characters:
Generated Code: |2389299920014}
Extra Characters: Apparently the first two and last three characters are not part of the bar code.
Question
Are the extra characters attached by the bar code reader (therefore dependent on the scanner) or are they an intrinsic part of the barcode?
Here is a sample image of a barcode:
http://imageshack.us/a/img824/1862/dm6x.jpg
Thanks
[SOLVED] My apologies. This was just another one of those cases of 'shooting your mouth off' without doing proper research.
Solution The code is EAN13. The prefix and suffix are probably scanner dependent. The 13 digits in between are as follows (first digit from the left) Check Sum (Next 9 digits) Company Id + Item Id (Last 3 Digits ) GS1 prefix

It's hard to answer without understanding what format you are trying to encode, what the intended contents are, and what the purported contents are.
Some formats add extra information as part of the encoding process, but it does not become part of the content. When correctly encoded and decoded, the output should match the input exactly.
Barcodes encode what they encode and there is no data that is somehow part of the barcode but not somehow encoded in it.
EAN-13 has no scanner-dependent considerations, no. The encoding and decoding of a given number is the same everywhere. EAN-13 encodes 13 digits, so I am not sure what the 13 digits "in between" mean.
You mention GS1, which is something else. A family of barcodes in fact. You'd have to say what specifically you are using. The GS1 encodings are likewise not ambiguous or scanner-dependent. You know what you want to encode, you encode it exactly, it's read exactly.

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?

In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"

To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/

If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.

There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.

You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.

try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio