How to validate license plate characters/codepoints? - validation

Although licenseplates from my country only use [AZ09], this is not true for international license plates. As licenseplates may be added from any country, I'd like to know what the best method is to validate a unicode character string for characters (containing a licenseplate).
Should I just close all unicode codeblocks and only open a few, e.g. Basic Latin, Latin-1 supplement and then whitelist characters?

Latin is not enough: https://en.wikipedia.org/wiki/Vehicle_registration_plate
Letters, numbers, punctuation and separators seems like a good fit, corresponding regex character class is [\pL\pN\pP\pZ].

Related

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

Which characters in UTF8 can have upper/lower case pair?

I know of a-z (A-Z), å (Å), ä (Ä), ö (Ö). But is there any official definition of which characters actually have a sibling in another case?
This is language specific. But do check out Case Mappings. This is part of the standard.
5.18 Case Mappings
Case is a normative property of characters in specific alphabets such as Latin, Greek, Cyrillic,
Armenian, and archaic Georgian, whereby characters are considered to be variants of a
single letter.
You may also want to check the European Alphabetic Scripts part for language specific information.

Ads API: valid characters

I'm using the mini_fb gem in ruby to create an ad group:
response = fb_ads.create_ad_groups_with_image('adgroup_specs' => adgroup_specs)
If the ad text contains certain characters, such as ∑, this fails with the following error:
The text contains an invalid character. Ads may only contain alphanumeric characters, punctuation, and spaces. Note that line breaks and = are not allowed.
However, there are many other characters, such as π, ö é, î, ä, å, ç, è, and ø, that are accepted just fine. Is there a list somewhere of what characters Facebook accepts in its ads, or a quick API call that I can make to check whether a string will pass?
The Facebook Ads system allows ad titles and body text in most languages around the world. However the symbol you've pasted in above is in a Unicode range dedicated to mathematical symbols. It isn't allowed in the body or title of a Facebook ad. The character you entered (Unicode U2211) has a good alternate in the Greek alphabet range of Unicode at U03A3. Entering HTML entities is not going to render like you want.
I don't have a link for you, but it is very likely that Facebook only supports/allows extended ASCII characters in their ads. That would include the characters you listed, but the "sum" character you listed is not within extended ascii. Have you tried using html entities for the "special" characters you need?

Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?
Here they are:
U+2B71F
U+2A52D
U+2A68F
U+2A690
U+2B72F
U+2B4F7
U+2B72B
How can I convert these to there readable symbols?
How about:
# Using pack
puts ["2B71F".hex].pack("U")
# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)
In Ruby 1.9+ you can also do:
puts "\u{2B71F}"
I.e. the \u{} escape sequence can be used to decode Unicode codepoints.
The unicode symbols like U+2B71F are referred to as a codepoint.
The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.
For example, U+221E is infinity.
The codepoints are hexadecimal numbers. There is always exactly one number defined per character.
There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.
Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.
codepoint = "U+2B71F"
You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.
codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
And you're UTF-8 character will be:
utf_8_character = [$1.hex].pack("U")
References:
Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
Tim Bray on the goodness of unicode.
Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Dissecting the Unicode regular expression

what kind of keyboard layout can type ISO 8859-1 Characters?

what kind of keyboard layout can type ISO 8859-1 Characters?
Example of what needs to be typed are:-
Ánam àbìa èbèa Ógbuá
First of all: Keyboard layouts and character sets are not directly tied to each other. If I type Ü on my keyboard while in a UTF-8 application, the resulting character will be
a UTF-8 character. If I type it in a ISO-8859-1 application, it will be a character from that character set.
That said, there isn't a keyboard layout that covers all ISO-8859-1 characters; every country layout covers a part of them.
Full list of characters
According to Wikipedia, ISO-8859-1 covers the following languages' special characters in full:
Afrikaans, Albanian, Basque, Breton, Catalan, English (UK and US), Faroese, Galician, German, Icelandic, Irish, (new
- orthography), Italian, Kurdish (The
Kurdish Unified Alphabet), Latin
(basic classical orthography), Leonese,
Luxembourgish (basic classical
orthography), Norwegian (Bokmål and
Nynorsk), Occitan, Portuguese,
Rhaeto-Romanic, Scottish, Gaelic,
Spanish, Swahili, Swedish, Walloon
so you can safely assume that the keyboard layouts of those countries cover a part of ISO-8859-1.
This is what I have decided to do. Hope it puts somebody else on the right footing.
With Special thanks to #Pekka for the patience, guidance and support.
// Replaces combination char with special chars
$phrase = "`U `are ^here tod`ay.";
$search = array("`U", "`a", "^h");
$replace = array("û", "ñ", "à");
$resulte = str_replace($search, $replace, $phrase);
Could be cleaner in a function though

Resources