Regular expression matching emoji in Mac OS X / iOS - ruby

Note: this question could look odd on systems not supporting the included emoji.
This is a follow-up question to How do I remove emoji from string.
I want to build a regular expression that matches all emoji that can be entered in Mac OS X / iOS.
The obvious Unicode blocks cover most, but not all of these emoji:
U+1F300..U+1F5FF Miscellaneous Symbols And Pictographs
U+1F600..U+1F64F Emoticons
U+1F650..U+1F67F Ornamental Dingbats
U+1F680..U+1F6FF Transport and Map Symbols
Wikipedia provides a compiled list of all the symbols available in Apple Color Emoji on OS X Mountain Lion and iOS 6, which looks like a good starting point: (slightly updated)
people = '๐Ÿ˜„๐Ÿ˜ƒ๐Ÿ˜€๐Ÿ˜Šโ˜บ๏ธ๐Ÿ˜‰๐Ÿ˜๐Ÿ˜˜๐Ÿ˜š๐Ÿ˜—๐Ÿ˜™๐Ÿ˜œ๐Ÿ˜๐Ÿ˜›๐Ÿ˜ณ๐Ÿ˜๐Ÿ˜”๐Ÿ˜Œ๐Ÿ˜’๐Ÿ˜ž๐Ÿ˜ฃ๐Ÿ˜ข๐Ÿ˜‚๐Ÿ˜ญ๐Ÿ˜ช๐Ÿ˜ฅ๐Ÿ˜ฐ๐Ÿ˜…๐Ÿ˜“๐Ÿ˜ฉ๐Ÿ˜ซ๐Ÿ˜จ๐Ÿ˜ฑ๐Ÿ˜ ๐Ÿ˜ก๐Ÿ˜ค๐Ÿ˜–๐Ÿ˜†๐Ÿ˜‹๐Ÿ˜ท๐Ÿ˜Ž๐Ÿ˜ด๐Ÿ˜ต๐Ÿ˜ฒ๐Ÿ˜Ÿ๐Ÿ˜ฆ๐Ÿ˜ง๐Ÿ˜ˆ๐Ÿ‘ฟ๐Ÿ˜ฎ๐Ÿ˜ฌ๐Ÿ˜๐Ÿ˜•๐Ÿ˜ฏ๐Ÿ˜ถ๐Ÿ˜‡๐Ÿ˜๐Ÿ˜‘๐Ÿ‘ฒ๐Ÿ‘ณ๐Ÿ‘ฎ๐Ÿ‘ท๐Ÿ’‚๐Ÿ‘ถ๐Ÿ‘ฆ๐Ÿ‘ง๐Ÿ‘จ๐Ÿ‘ฉ๐Ÿ‘ด๐Ÿ‘ต๐Ÿ‘ฑ๐Ÿ‘ผ๐Ÿ‘ธ๐Ÿ˜บ๐Ÿ˜ธ๐Ÿ˜ป๐Ÿ˜ฝ๐Ÿ˜ผ๐Ÿ™€๐Ÿ˜ฟ๐Ÿ˜น๐Ÿ˜พ๐Ÿ‘น๐Ÿ‘บ๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š๐Ÿ’€๐Ÿ‘ฝ๐Ÿ’ฉ๐Ÿ”ฅโœจ๐ŸŒŸ๐Ÿ’ซ๐Ÿ’ฅ๐Ÿ’ข๐Ÿ’ฆ๐Ÿ’ง๐Ÿ’ค๐Ÿ’จ๐Ÿ‘‚๐Ÿ‘€๐Ÿ‘ƒ๐Ÿ‘…๐Ÿ‘„๐Ÿ‘๐Ÿ‘Ž๐Ÿ‘Œ๐Ÿ‘ŠโœŠโœŒ๐Ÿ‘‹โœ‹๐Ÿ‘๐Ÿ‘†๐Ÿ‘‡๐Ÿ‘‰๐Ÿ‘ˆ๐Ÿ™Œ๐Ÿ™โ˜๐Ÿ‘๐Ÿ’ช๐Ÿšถ๐Ÿƒ๐Ÿ’ƒ๐Ÿ‘ซ๐Ÿ‘ช๐Ÿ‘ฌ๐Ÿ‘ญ๐Ÿ’๐Ÿ’‘๐Ÿ‘ฏ๐Ÿ™†๐Ÿ™…๐Ÿ’๐Ÿ™‹๐Ÿ’†๐Ÿ’‡๐Ÿ’…๐Ÿ‘ฐ๐Ÿ™Ž๐Ÿ™๐Ÿ™‡๐ŸŽฉ๐Ÿ‘‘๐Ÿ‘’๐Ÿ‘Ÿ๐Ÿ‘ž๐Ÿ‘ก๐Ÿ‘ ๐Ÿ‘ข๐Ÿ‘•๐Ÿ‘”๐Ÿ‘š๐Ÿ‘—๐ŸŽฝ๐Ÿ‘–๐Ÿ‘˜๐Ÿ‘™๐Ÿ’ผ๐Ÿ‘œ๐Ÿ‘๐Ÿ‘›๐Ÿ‘“๐ŸŽ€๐ŸŒ‚๐Ÿ’„๐Ÿ’›๐Ÿ’™๐Ÿ’œ๐Ÿ’šโค๐Ÿ’”๐Ÿ’—๐Ÿ’“๐Ÿ’•๐Ÿ’–๐Ÿ’ž๐Ÿ’˜๐Ÿ’Œ๐Ÿ’‹๐Ÿ’๐Ÿ’Ž๐Ÿ‘ค๐Ÿ‘ฅ๐Ÿ’ฌ๐Ÿ‘ฃ๐Ÿ’ญ'
nature = '๐Ÿถ๐Ÿบ๐Ÿฑ๐Ÿญ๐Ÿน๐Ÿฐ๐Ÿธ๐Ÿฏ๐Ÿจ๐Ÿป๐Ÿท๐Ÿฝ๐Ÿฎ๐Ÿ—๐Ÿต๐Ÿ’๐Ÿด๐Ÿ‘๐Ÿ˜๐Ÿผ๐Ÿง๐Ÿฆ๐Ÿค๐Ÿฅ๐Ÿฃ๐Ÿ”๐Ÿ๐Ÿข๐Ÿ›๐Ÿ๐Ÿœ๐Ÿž๐ŸŒ๐Ÿ™๐Ÿš๐Ÿ ๐ŸŸ๐Ÿฌ๐Ÿณ๐Ÿ‹๐Ÿ„๐Ÿ๐Ÿ€๐Ÿƒ๐Ÿ…๐Ÿ‡๐Ÿ‰๐ŸŽ๐Ÿ๐Ÿ“๐Ÿ•๐Ÿ–๐Ÿ๐Ÿ‚๐Ÿฒ๐Ÿก๐ŸŠ๐Ÿซ๐Ÿช๐Ÿ†๐Ÿˆ๐Ÿฉ๐Ÿพ๐Ÿ’๐ŸŒธ๐ŸŒท๐Ÿ€๐ŸŒน๐ŸŒป๐ŸŒบ๐Ÿ๐Ÿƒ๐Ÿ‚๐ŸŒฟ๐ŸŒพ๐Ÿ„๐ŸŒต๐ŸŒด๐ŸŒฒ๐ŸŒณ๐ŸŒฐ๐ŸŒฑ๐ŸŒผ๐ŸŒ๐ŸŒž๐ŸŒ๐ŸŒš๐ŸŒ‘๐ŸŒ’๐ŸŒ“๐ŸŒ”๐ŸŒ•๐ŸŒ–๐ŸŒ—๐ŸŒ˜๐ŸŒœ๐ŸŒ›๐ŸŒ™๐ŸŒ๐ŸŒŽ๐ŸŒ๐ŸŒ‹๐ŸŒŒ๐ŸŒ โญโ˜€โ›…โ˜โšกโ˜”โ„โ›„๐ŸŒ€๐ŸŒ๐ŸŒˆ๐ŸŒŠ'
objects = '๐ŸŽ๐Ÿ’๐ŸŽŽ๐ŸŽ’๐ŸŽ“๐ŸŽ๐ŸŽ†๐ŸŽ‡๐ŸŽ๐ŸŽ‘๐ŸŽƒ๐Ÿ‘ป๐ŸŽ…๐ŸŽ„๐ŸŽ๐ŸŽ‹๐ŸŽ‰๐ŸŽŠ๐ŸŽˆ๐ŸŽŒ๐Ÿ”ฎ๐ŸŽฅ๐Ÿ“ท๐Ÿ“น๐Ÿ“ผ๐Ÿ’ฟ๐Ÿ“€๐Ÿ’ฝ๐Ÿ’พ๐Ÿ’ป๐Ÿ“ฑโ˜Ž๐Ÿ“ž๐Ÿ“Ÿ๐Ÿ“ ๐Ÿ“ก๐Ÿ“บ๐Ÿ“ป๐Ÿ”Š๐Ÿ”‰๐Ÿ”ˆ๐Ÿ”‡๐Ÿ””๐Ÿ”•๐Ÿ“ข๐Ÿ“ฃโณโŒ›โฐโŒš๐Ÿ”“๐Ÿ”’๐Ÿ”๐Ÿ”๐Ÿ”‘๐Ÿ”Ž๐Ÿ’ก๐Ÿ”ฆ๐Ÿ”†๐Ÿ”…๐Ÿ”Œ๐Ÿ”‹๐Ÿ”๐Ÿ›๐Ÿ›€๐Ÿšฟ๐Ÿšฝ๐Ÿ”ง๐Ÿ”ฉ๐Ÿ”จ๐Ÿšช๐Ÿšฌ๐Ÿ’ฃ๐Ÿ”ซ๐Ÿ”ช๐Ÿ’Š๐Ÿ’‰๐Ÿ’ฐ๐Ÿ’ด๐Ÿ’ต๐Ÿ’ท๐Ÿ’ถ๐Ÿ’ณ๐Ÿ’ธ๐Ÿ“ฒ๐Ÿ“ง๐Ÿ“ฅ๐Ÿ“คโœ‰๐Ÿ“ฉ๐Ÿ“จ๐Ÿ“ฏ๐Ÿ“ซ๐Ÿ“ช๐Ÿ“ฌ๐Ÿ“ญ๐Ÿ“ฎ๐Ÿ“ฆ๐Ÿ“๐Ÿ“„๐Ÿ“ƒ๐Ÿ“‘๐Ÿ“Š๐Ÿ“ˆ๐Ÿ“‰๐Ÿ“œ๐Ÿ“‹๐Ÿ“…๐Ÿ“†๐Ÿ“‡๐Ÿ“๐Ÿ“‚โœ‚๐Ÿ“Œ๐Ÿ“Žโœ’โœ๐Ÿ“๐Ÿ“๐Ÿ“•๐Ÿ“—๐Ÿ“˜๐Ÿ“™๐Ÿ““๐Ÿ“”๐Ÿ“’๐Ÿ“š๐Ÿ“–๐Ÿ”–๐Ÿ“›๐Ÿ”ฌ๐Ÿ”ญ๐Ÿ“ฐ๐ŸŽจ๐ŸŽฌ๐ŸŽค๐ŸŽง๐ŸŽผ๐ŸŽต๐ŸŽถ๐ŸŽน๐ŸŽป๐ŸŽบ๐ŸŽท๐ŸŽธ๐Ÿ‘พ๐ŸŽฎ๐Ÿƒ๐ŸŽด๐Ÿ€„๐ŸŽฒ๐ŸŽฏ๐Ÿˆ๐Ÿ€โšฝโšพ๐ŸŽพ๐ŸŽฑ๐Ÿ‰๐ŸŽณโ›ณ๐Ÿšต๐Ÿšด๐Ÿ๐Ÿ‡๐Ÿ†๐ŸŽฟ๐Ÿ‚๐ŸŠ๐Ÿ„๐ŸŽฃโ˜•๐Ÿต๐Ÿถ๐Ÿผ๐Ÿบ๐Ÿป๐Ÿธ๐Ÿน๐Ÿท๐Ÿด๐Ÿ•๐Ÿ”๐ŸŸ๐Ÿ—๐Ÿ–๐Ÿ๐Ÿ›๐Ÿค๐Ÿฑ๐Ÿฃ๐Ÿฅ๐Ÿ™๐Ÿ˜๐Ÿš๐Ÿœ๐Ÿฒ๐Ÿข๐Ÿก๐Ÿณ๐Ÿž๐Ÿฉ๐Ÿฎ๐Ÿฆ๐Ÿจ๐Ÿง๐ŸŽ‚๐Ÿฐ๐Ÿช๐Ÿซ๐Ÿฌ๐Ÿญ๐Ÿฏ๐ŸŽ๐Ÿ๐ŸŠ๐Ÿ‹๐Ÿ’๐Ÿ‡๐Ÿ‰๐Ÿ“๐Ÿ‘๐Ÿˆ๐ŸŒ๐Ÿ๐Ÿ๐Ÿ ๐Ÿ†๐Ÿ…๐ŸŒฝ'
places = '๐Ÿ ๐Ÿก๐Ÿซ๐Ÿข๐Ÿฃ๐Ÿฅ๐Ÿฆ๐Ÿช๐Ÿฉ๐Ÿจ๐Ÿ’’โ›ช๐Ÿฌ๐Ÿค๐ŸŒ‡๐ŸŒ†๐Ÿฏ๐Ÿฐโ›บ๐Ÿญ๐Ÿ—ผ๐Ÿ—พ๐Ÿ—ป๐ŸŒ„๐ŸŒ…๐ŸŒƒ๐Ÿ—ฝ๐ŸŒ‰๐ŸŽ ๐ŸŽกโ›ฒ๐ŸŽข๐Ÿšขโ›ต๐Ÿšค๐Ÿšฃโš“๐Ÿš€โœˆ๐Ÿ’บ๐Ÿš๐Ÿš‚๐ŸšŠ๐Ÿš‰๐Ÿšž๐Ÿš†๐Ÿš„๐Ÿš…๐Ÿšˆ๐Ÿš‡๐Ÿš๐Ÿš‹๐Ÿšƒ๐ŸšŽ๐ŸšŒ๐Ÿš๐Ÿš™๐Ÿš˜๐Ÿš—๐Ÿš•๐Ÿš–๐Ÿš›๐Ÿšš๐Ÿšจ๐Ÿš“๐Ÿš”๐Ÿš’๐Ÿš‘๐Ÿš๐Ÿšฒ๐Ÿšก๐ŸšŸ๐Ÿš ๐Ÿšœ๐Ÿ’ˆ๐Ÿš๐ŸŽซ๐Ÿšฆ๐Ÿšฅโš ๐Ÿšง๐Ÿ”ฐโ›ฝ๐Ÿฎ๐ŸŽฐโ™จ๐Ÿ—ฟ๐ŸŽช๐ŸŽญ๐Ÿ“๐Ÿšฉ๐Ÿ‡ฏ๐Ÿ‡ต๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡จ๐Ÿ‡ณ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ท๐Ÿ‡บ๐Ÿ‡ฌ๐Ÿ‡ง'
symbols = '1๏ธโƒฃ2๏ธโƒฃ3๏ธโƒฃ4๏ธโƒฃ5๏ธโƒฃ6๏ธโƒฃ7๏ธโƒฃ8๏ธโƒฃ9๏ธโƒฃ0๏ธโƒฃ๐Ÿ”Ÿ๐Ÿ”ข#๏ธโƒฃ๐Ÿ”ฃโฌ†๏ธโฌ‡๏ธโฌ…๏ธโžก๏ธ๐Ÿ” ๐Ÿ”ก๐Ÿ”คโ†—๏ธโ†–๏ธโ†˜๏ธโ†™๏ธโ†”๏ธโ†•๏ธ๐Ÿ”„โ—€๏ธโ–ถ๏ธ๐Ÿ”ผ๐Ÿ”ฝโ†ฉ๏ธโ†ช๏ธโ„น๏ธโชโฉโซโฌโคต๏ธโคด๏ธ๐Ÿ†—๐Ÿ”€๐Ÿ”๐Ÿ”‚๐Ÿ†•๐Ÿ†™๐Ÿ†’๐Ÿ†“๐Ÿ†–๐Ÿ“ถ๐ŸŽฆ๐Ÿˆ๐Ÿˆฏ๐Ÿˆณ๐Ÿˆต๐Ÿˆด๐Ÿˆฒ๐Ÿ‰๐Ÿˆน๐Ÿˆบ๐Ÿˆถ๐Ÿˆš๐Ÿšป๐Ÿšน๐Ÿšบ๐Ÿšผ๐Ÿšพ๐Ÿšฐ๐Ÿšฎ๐Ÿ…ฟ๏ธโ™ฟ๏ธ๐Ÿšญ๐Ÿˆท๐Ÿˆธ๐Ÿˆ‚โ“‚๏ธ๐Ÿ›‚๐Ÿ›„๐Ÿ›…๐Ÿ›ƒ๐Ÿ‰‘ใŠ™๏ธใŠ—๏ธ๐Ÿ†‘๐Ÿ†˜๐Ÿ†”๐Ÿšซ๐Ÿ”ž๐Ÿ“ต๐Ÿšฏ๐Ÿšฑ๐Ÿšณ๐Ÿšท๐Ÿšธโ›”โœณ๏ธโ‡๏ธโŽโœ…โœด๏ธ๐Ÿ’Ÿ๐Ÿ†š๐Ÿ“ณ๐Ÿ“ด๐Ÿ…ฐ๐Ÿ…ฑ๐Ÿ†Ž๐Ÿ…พ๐Ÿ’ โžฟโ™ป๏ธโ™ˆ๏ธโ™‰๏ธโ™Š๏ธโ™‹๏ธโ™Œ๏ธโ™๏ธโ™Ž๏ธโ™๏ธโ™๏ธโ™‘๏ธโ™’๏ธโ™“๏ธโ›Ž๐Ÿ”ฏ๐Ÿง๐Ÿ’น๐Ÿ’ฒ๐Ÿ’ฑยฉ๏ธยฎ๏ธโ„ข๏ธโŒโ€ผ๏ธโ‰๏ธโ—โ“โ•โ”โญ•๐Ÿ”๐Ÿ”š๐Ÿ”™๐Ÿ”›๐Ÿ”œ๐Ÿ”ƒ๐Ÿ•›๐Ÿ•ง๐Ÿ•๐Ÿ•œ๐Ÿ•‘๐Ÿ•๐Ÿ•’๐Ÿ•ž๐Ÿ•“๐Ÿ•Ÿ๐Ÿ•”๐Ÿ• ๐Ÿ••๐Ÿ•–๐Ÿ•—๐Ÿ•˜๐Ÿ•™๐Ÿ•š๐Ÿ•ก๐Ÿ•ข๐Ÿ•ฃ๐Ÿ•ค๐Ÿ•ฅ๐Ÿ•ฆโœ–๏ธโž•โž–โž—โ™ โ™ฅโ™ฃโ™ฆ๐Ÿ’ฎ๐Ÿ’ฏโœ”โ˜‘๐Ÿ”˜๐Ÿ”—โžฐใ€ฐใ€ฝ๏ธ๐Ÿ”ฑโ—ผ๏ธโ—ป๏ธโ—พ๏ธโ—ฝ๏ธโ–ช๏ธโ–ซ๏ธ๐Ÿ”บ๐Ÿ”ฒ๐Ÿ”ณโšซ๏ธโšช๏ธ๐Ÿ”ด๐Ÿ”ต๐Ÿ”ปโฌœ๏ธโฌ›๏ธ๐Ÿ”ถ๐Ÿ”ท๐Ÿ”ธ๐Ÿ”น'
emoji = people + nature + objects + places + symbols # all emoji combined
Most characters have a single code point and converting these would be easy:
๐Ÿ˜€ U+1F600 (Grinning Face)
But some characters are "encoded using two Unicode values":
โ˜บ๏ธ U+263A U+FE0F (White Smiling Face, Variation Selector 16)
๐Ÿ‡ฏ๐Ÿ‡ต U+1F1EF U+1F1F5 (Regional Indicator Symbol Letter J / Regional Indicator Symbol Letter P)
โฌ›๏ธ U+2B1B U+FE0F (Black Large Square / Variation Selector 16)
And some even have 3 codepoints:
๏ธโƒฃ U+0023 U+FE0F U+20E3 (Number Sign / Variation Selector 16 / Combining Enclosing Keycap)
(Variation Selector 16 means "emoji style")
How can I split this list into characters (without splitting combined characters), find their code point(s) and finally build a regular expression matching them?
The regex doesn't have to respect "missing" characters within larger blocks, i.e. it's okay if the 4 Unicode blocks mentioned above are entirely covered.
(I'm going to answer this myself if I don't get any answers, but maybe there's an easy solution)

The upcoming Unicode Emoji data files would help with this. At the moment these are still drafts, but they might still help you out.
By parsing http://www.unicode.org/Public/emoji/1.0/emoji-data.txt you could get quite easily get a list of all emoji in the Unicode standard. (Note that some of these emoji consist of multiple code points.) Once you have such a list, itโ€™s trivial to turn it into a regular expression.
Hereโ€™s a JavaScript version: https://github.com/mathiasbynens/emoji-regex/blob/master/index.js And hereโ€™s the script that generates it based on the data from emoji-data.txt: https://github.com/mathiasbynens/emoji-regex/blob/master/scripts/generate-regex.js

This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
Examples can be found here: https://stackoverflow.com/a/29115920/1911674
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.

Related

DT_WORDBREAK: list of word break symbols

I use DT_WORDBREAK flag when I call DrawTextEx. About this flag MSDN says:
Lines are automatically broken between words if a word extends past
the edge of the rectangle specified by the lprc parameter. A carriage
return-line feed sequence also breaks the line.
But I cannot find "official" list of symbols that are used as word break symbols. Is it exist?
If you get the TEXTMETRICs for the font you're using, it corresponds to the tmBreakChar field.
For any Latin font, this is almost certainly just the plain old space character (Unicode U+0020 SPACE or ASCII 32).
I don't think DrawTextEx does anything fancier. You'd have to use a more advanced API to get more sophisticated behavior such as breaking after hyphens, soft-hyphens, other kinds of spaces, etc.

Why 'ะฐnd' == 'and' is false?

I tagged character-encoding and text because I know if you type 'and' == 'and' into the rails console, or most any other programming language, you will get true. However, I am having the issue when one of my users pastes his text into my website, I can't spell check it properly or verify it's originality via copyscape because of some issue with the text. (or maybe my understanding of text encoding?)
EXAMPLE:
If you copy and paste the following line into the rails console you will get false.
'ะฐnd' == 'and' #=> false
If you copy and paste the following line into the rails console you will get true even though they appear exactly the same in the browser.
'and' == 'and' #=> true
The difference is, in the first example, the first 'ะฐnd' is copied and pasted from my user's text that is causing the issues. All the other instances of 'and' are typed into the browser.
Is this an encoding issue?
How to fix my issue?
This isnโ€™t really an encoding problem, in the first case the strings compare as false simply because they are different.
The first character of the first string isnโ€™t a โ€normalโ€œ a, it is actually U+0430 CYRILLIC SMALL LETTER A โ€” the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a โ€œnormalโ€ Latin a, which is U+0061 LATIN SMALL LETTER A.
Hereโ€™s the โ€œnormalโ€ a: a, and this is the Cyrillic a: ะฐ, they appear pretty much identical.
The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.
You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find youโ€™re too strict about conversions.
Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Rubyโ€˜s regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.
The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.
The problem is also not with normalization or folding.
The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.
These three codepoints resolve to the following three characters:
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.
These three codepoints resolve to the following three characters:
LATIN SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.
This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.
What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:
Where does that cyrillic a come from?
Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.
This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.
I think it is eccoding issue, you can have a try like this.
irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"
irb(main):011:0> 'ะฐnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "ะฐnd"

Terminal overwriting same line when too long

In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]โ˜”๏ธŽ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
โ˜‚โ˜ƒโ˜ฝโ˜€๏ธŽโ˜”๏ธŽ
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.
The symbol being printed โ˜”๏ธŽ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: โ˜”๏ธ‰ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[โ€ฆ\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.
This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '

Decoded barcode extra digits

I am trying to come to terms with how a barcode is decoded and generated by a scanner.
A note from the client says the following generated bar code consists of extra characters:
Generated Code: |2389299920014}
Extra Characters: Apparently the first two and last three characters are not part of the bar code.
Question
Are the extra characters attached by the bar code reader (therefore dependent on the scanner) or are they an intrinsic part of the barcode?
Here is a sample image of a barcode:
http://imageshack.us/a/img824/1862/dm6x.jpg
Thanks
[SOLVED] My apologies. This was just another one of those cases of 'shooting your mouth off' without doing proper research.
Solution The code is EAN13. The prefix and suffix are probably scanner dependent. The 13 digits in between are as follows (first digit from the left) Check Sum (Next 9 digits) Company Id + Item Id (Last 3 Digits ) GS1 prefix
It's hard to answer without understanding what format you are trying to encode, what the intended contents are, and what the purported contents are.
Some formats add extra information as part of the encoding process, but it does not become part of the content. When correctly encoded and decoded, the output should match the input exactly.
Barcodes encode what they encode and there is no data that is somehow part of the barcode but not somehow encoded in it.
EAN-13 has no scanner-dependent considerations, no. The encoding and decoding of a given number is the same everywhere. EAN-13 encodes 13 digits, so I am not sure what the 13 digits "in between" mean.
You mention GS1, which is something else. A family of barcodes in fact. You'd have to say what specifically you are using. The GS1 encodings are likewise not ambiguous or scanner-dependent. You know what you want to encode, you encode it exactly, it's read exactly.

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('ยต')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it โ€“ say, the currency symbols โ‚ฌยฃยฅ$ (plus a trailing newline) โ€“ running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that โ‚ฌ is U+20AC, ยฃ is U+00A3, and ยฅ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table โ€“ or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy โ€“ it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
โ‚ฌ encodes to "\u20AC".
ยฃ encodes to "\u00A3".
ยฅ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of ๐Ÿ™‹๐Ÿพ ัž ัƒฬ†, the output would be:
๐Ÿ™‹ encodes to "\u{1F64B}".
๐Ÿพ encodes to "\u{1F3FE}".
ัž encodes to "\u045E".
ัƒ encodes to "\u0443". ฬ†
encodes to "\u0306".
This is because ๐Ÿ™‹๐Ÿพ is actually encoded as two code points: a base character (๐Ÿ™‹ - U+1F64B), with a modifier (๐Ÿพ, U+1F3FE; see also). Similarly with one of the letters: the first, ัž, is a single pre-combined code point (U+045E), while the second, ัƒฬ† โ€“ though it looks the same โ€“ is formed by combining ัƒ (U+0443) with the modifier ฬ† (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for โœ“, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for ๐Ÿ˜. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character ๐Ÿ™‹ plus the modifier ๐Ÿพ, thus ultimately yielding the abstract character ๐Ÿ™‹๐Ÿพ (as seen above).
This works with shorter code points, too. For example, that currency character string from above (โ‚ฌยฃยฅ$) could be represented with \u{20AC A3 A5 24} โ€“ requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use รค, วน, รบ and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100ูช".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

Resources