I need to create a validation rule, that checks if inserted text complies to national identification number standards (https://en.wikipedia.org/wiki/National_identification_number).
In the identification number, allowed are following: letters (along with letters outside ASCII a-zA-Z), numbers, spaces, hyphens, plus signs, equal signs and slashes.
Writing individual restriction for every character would mean the validation rule to be extremely long
...WHERE (National_ID LIKE "0##########") OR (National_ID LIKE "0############")
and I am wondering if there is better way that would make more sense.
Also, Access does not support Regex, which complicates the task even further.
Related
R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.
I am not sure why the standard tokenizer (used by the default standard analyzer) behaves like this in this scenario:
- If I use the word system.exe it generates the token system.exe. I understand . is not a word breaker.
- If I use the word system32.exe it generates the tokens system and exe. I don´t understand this, why it breaks the word when it finds a number + a . ?
- If I use the word system32tm.exe it generates the token system32tm.exe. As in the first example, it works as expected, not breaking the word into different tokens.
I have read http://unicode.org/reports/tr29/#Word_Boundaries but I still don´t understand why a number + dot (.) is a word boundary
As mentioned in the question, the standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29
The rule http://unicode.org/reports/tr29/#Word_Boundaries is to not break if you have letter + dot + letter, see WB6 in the above spec. So tm.exe is preserved and system32.exe is split.
The spec says that it always splits, except for the listed exceptions. Exceptions WB6 and WB7 say that it never splits on letter, then punctuation, then letter. Rules WB11 and WB12 say that it never splits on number, then punctuation, then number. However there is no such rule for number then punctuation then letter, so the default rule applies and system32.exe gets splitted.
The regex below:
EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
is what I initially used to validate email format. After finding that the format "name#email...com" was passing my tests, I copy/pasted a different piece of regex that limits the amount of periods. This looks like:
EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(?:\.[a-z\d\-]+)*\.[a-z]+\z/i
The main difference is the piece of regex below:
(?:\.[a-z\d\-]+)
I can't quite figure out how this bit works. Can someone break it down for me?
Notice that in this subexpression:
(?:\.[a-z\d\-]+)
The character class [a-z\d-] does not contain a period. The expression requires there to be at least one (+) of those characters after the period (\.) in order to match. Therefore, a series of periods with no letters or digits or hyphens between them won't match the repetition of the subexpression.
The problem with your regular expression here is that you're allowing for multiple dots:
/[a-z\.]+\.[a-z]+\z/
To fix this you need to make your repeating pattern more specific in terms of structure:
/(?:[a-z]+\.)+[a-z]+\z/
That means you can have one or more repeating groups of letters plus dot. That will exclude multiple dots in a row.
Do keep in mind that email addresses are getting increasingly insane with the introduction of new GTLDs that are often used without any sort of prefix. That is, example#google may be a valid address in the future. You can't expect there to be a dot in the domain.
You have [a-z\d\-]+(?:\.[a-z\d\-]+)*. The [a-z\d\-]+ part ensures that this part of the string starts with a sequence of at least one non-period character. A period is only allowed one per (?:\.[a-z\d\-]+) structure. In each (?:\.[a-z\d\-]+), the period \. is necessarily followed by [a-z\d\-]+, which includes at least one non-period character. This ensures that whenever a period appears, it has at least one non-period character on the left and on the right. In other words, consecutive periods are not allowed.
All lowercase and uppercase, all digits, dot and slash.
Have I missed anything?
This seems like an very easy question found to find at Google but actually I haven't found any information about it :(
Edit, if anybody missunderstod, what characters can the OUTPUT have.
I'm not asking what kind of stuff I can hash, I'm asking what the hash looks like.
DES (and many other encryption algorithms) work on a bit level - it has no concept of what's a valid character and what isn't, the range of the output characters can be anything from 0x00 to 0xFF.
Any output to the contrary is likely just characters not supported by whatever you're trying to display the output with, which are typically replaced by some predefined character.
The output can also be converted to hex characters for cosmetic or storage purposes (I'm not sure whether the des command would do this - it's simple enough to see by just running it), e.g. a single 'a' (0x61) character will be converted to two characters: '61'. The resulting output characters would thus be in the range A-F or a-f and 0-9.
Note that keys require ASCII, but this is not a requirement of DES itself, as can be derived from "Bugs" on the same page, and it doesn't affect the range of output values.
The DES algorithm is considered obsolete and unsafe. The DES standard (FIPS 46-3) has been withdrawn in 2005.
Use at your own risk.
See http://en.wikipedia.org/wiki/Data_Encryption_Standard
When I get data from some website, sometime the data is encode in utf8 but look like this:
Thỏ , Nạt
The accent mark is seperated from character when in fact these string must be:
Thỏ, Nạt
I don't know what is the problem here and how to correct it. Can someone help me with this
The first sample string contains two Vietnamese characters in decomposed form. The first one of them is “ỏ”, consisting of simple letter “o” followed by U+0309 COMBINING HOOK ABOVE.
The second sample string has those characters in precomposed form. The first one of them is “ỏ” U+1ECF LATIN SMALL LETTER O WITH HOOK ABOVE.
The decomposed and precomposed form are defined to be “canonical equivalent” and are normally expected to result in the same rendering (though this does not always happen). They are not identical, however; in programmatic comparison of characters and strings, they are very much different.
Mostly Latin letters with diacritics, such as “é” and “ä”, are used in precomposed form only, since that’s what keyboard drivers, online keyboards, character picking utilities, etc., normally produce. However, Vietnamese keyboard drivers often work so that some diacritic marks are entered after entering a base character, and the diacritic is thus produced as a combining character, i.e. the letter (like “ỏ”) is then in decomposed form.
One way of dealing with this issue, recommended in many contexts, is to convert your strings to Normalization Form C (NFC). This would put these characters into precomposed form. Note, however, that conversion to NFC removes some other distinctions, too (but this is not relevant if the text is in Vietnamese only and does not contain special symbols).
It remains a mystery why the first sample string has a space character before the comma.