UTF-8 characters with Adafruit_GFX - utf-8

I am using Arduino with OPEN-SMART Touch Screen Expansion Shield, which uses Adafruit_GFX library. I would need to print characters from whole UTF-8 as for example letters with diactritics, Greek letters and so on. If I try to print these characters with default font, it prints some nonsense. What should I do?

Related

How can I print a unicode character from its name?

In Python, I can use \N to print unicode characters.
print('\N{White Smiling Face}')
Will print ☺
Is there an way to do the same thing in Go? Couldn't find anything in unicode or x/text.

How to print a tilde (~) in Zebra Programming Language (ZPL)

I am maintaining a program that outputs ZPL to a label printer. Today, the character sequence ~Ja came in as part of a string to be printed, which is ZPL's "cancel all" command. Needless to say, the label did not print.
Is there an easy way in ZPL to escape a tilde?
You can use ~CT or ^CT to change the tilde control character to any other ASCII character, and then you can print tildes normally. However, the new control character won't be printable. This is probably going to be quite a hassle to maintain.
An example changing the control command prefix to +, taken from page 165 of the ZPL II programming guide:
^XA
^CT+
^XZ
+HS
If your string is represented as field data with ^FD, ^FV, or ^SN, you can use ^FH to encode the tilde in the string with its hex value, 7E.
An example, taken from page 192 of the ZPL II programming guide:
^XA
^FO100,100
^AD^FH
^FDTilde _7e used for HEX^FS
^XZ
Output:
Tilde ~ used for HEX
~ can be printed by replacing to \7E
It seeems like replacing these three characters will allow any key on the keyboard to print fine. I figured this out using ZebraDesigner, printing to a file and seeing what characters they escape.
\ to \1F - do this first or it will break the two below
~ to \7E
^ to \5E
Here is the code in C#
private static string escapeChars(string working)
{
working = working.Replace(#"\", #"\1F");
working = working.Replace(#"~", #"\7E");
working = working.Replace(#"^", #"\5E");
return working;
}

\w in Ruby Regular Expression matches Chinese characters

I use the code below:
puts "matched" if "中国" =~ /\w+/
it puts "matched" and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".
Could somebody give me some clues?
I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:
\w
Matches any word character. For
non-Unicode and ECMAScript
implementations, this is the same as
[a-zA-Z_0-9]. In Unicode categories,
this is the same as
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
So it's not the case that \w necessarily just means [a-zA-Z_0-9] - it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.
This still makes it different from . though, as \w wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.
As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc} does match, you can see on a Unicode reference list:
\p{Ll} Lowercase Unicode letter
\p{Lu} Uppercase Unicode letter
\p{Lt} Titlecase Unicode letter
\p{Lo} Other Unicode letter
\p{Nd} Decimal, number
\p{Pc} "Punctuation, connector"
Oniguruma, which is the regex engine in Ruby 1.9+, defines \w as:
[\w] word character
Not Unicode:
* alphanumeric, "_" and multibyte char.
Unicode:
* General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In 1.9+, Ruby knows if the string has Unicode characters, and automatically switches to use Unicode mode for pattern matching.

Remove all but some special characters

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

How to remove all non - ASCII characters from a string in Ruby

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g © etc. See the following example.
#coding: utf-8
s = " Hello this a mixed string © that I made."
puts s.encoding
puts s.encode
output:
UTF-8
Hello this a mixed str
ing © that I made.
When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT
So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.
I have been searching and experimenting for quite some time now.
If I try to use
puts s.encode('ASCII-8BIT')
It gives the error:
: "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)
You can just literally translate what you asked into a Regexp. You wrote:
I want to get rid of all non ASCII characters
We can rephrase that a little bit:
I want to substitue all characters which don't thave the ASCII property with nothing
And that's a statement that can be directly expressed in a Regexp:
s.gsub!(/\P{ASCII}/, '')
As an alternative, you could also use String#delete!:
s.delete!("^\u{0000}-\u{007F}")
Strip out the characters using regex. This example is in C# but the regex should be the same:
How can you strip non-ASCII characters from a string? (in C#)
Translating it into ruby using gsub should not be difficult.
UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.
This is all just off the top of my head. I could be wrong.

Resources