What are the valid characters in the domain part of e-mail address? - validation

Intention
I'm trying to do some minimal very minimal validation of e-mail addresses, despite seeing a lot of advice advising against doing that. The reason I'm doing this is that spec I am implementing requires e-mail addresses to be in this format:
mailto:<uri-encoded local part>#<domain part>
I'd like to simply split on the starting mailto: and the final #, and assume the "local part" is between these. I'll verify that the "local part" is URI encoded.
I don't want to do much more than this, and the spec allows for me to get away with "best effort" validation for most of this, but is very specific on the URI encoding and the mailto: prefix.
Problem
From everything I've read, splitting on the # seems risky to me.
I've seen a lot of conflicting advice on the web and on Stack Overflow answers, most of it saying "read the RFCs", and some of it saying that the domain part can only be certain characters, i.e. 1-9 a-z A-Z -., maybe a couple other characters, but not much more than this. E.g.:
What characters are allowed in an email address?
When I read various RFCs on domain names, I see that "any CHAR" (dtext) or "any character between ASCII 33 and 90" (dtext) are allowed, which implies # symbols are allowed. This is further compounded because "comments" are allowed in parens ( ) and can contain characters between ASCII 42 and 91 which include #.
RFC1035 seems to support the letters+digits+dashes+periods requirement, but "domain literal" syntax in RFC5322 seems to allow more characters.
Am I misunderstanding the RFC, or is there something I'm missing that disallows a # in the domain part of an e-mail address? Is "domain literal" syntax something I don't have to worry about?

The most recent RFC for email on the internet is RFC 5322 and it specifically addresses addresses.
addr-spec = local-part "#" domain
local-part = dot-atom / quoted-string / obs-local-part
The dot-atom is a highly restricted set of characters defined in the spec. However, the quoted-string is where you can run into trouble. It's not often used, but in terms of the possibility that you'll run into it, you could well get something in quotation marks that could itself contain an # character.
However, if you split the string from the last #, you should safely have located the local-part and the domain, which is well defined in the specification in terms of how you can verify it.
The problem comes with punycode, whereby almost any Unicode character can be mapped into a valid DNS name. If the system you are front-ending can understand and interpret punycode, then you have to handle almost anything that has valid unicode characters in it. If you know you're not going to work with punycode, then you can use a more restricted set, generally letters, digits, and the hyphen character.
To quote the late, great Jon Postel:
TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.
Side note on the local part:
Keeping in mind, of course, that there are probably lots of systems on the internet that don't require strict adherence to the specs and therefore might allow things outside of the spec to work due to the long standing liberal-acceptance/conservative-transmission philosophy.

Related

Does a Punycode domain name (UName) store the IDN table used?

I've created a domain name such as: même.vip
I can see in the database, that the domain name has been registered with IDN table: "fr".
However, 'ê' can be Portuguese, Norwegian, etc...
I am trying to understand who is assuming the IDN table here...
I can see the EPP transaction - it is not using the IDN extension and therefore cannot supply an IDN table to the server, even if it wanted to
I cannot access the code that populated that DB record
Therefore, my best chance is to know if the Punycode domain name contains information on which table was used. If not: then I know it's the DB or some service at the registry, after the EPP command.
(Of course, if the punycode DOES contain the IDN table, then I have more digging to do!)
Does a Punycode domain name (UName) store the IDN table used?
TL;DR: No.
You are mixing multiple things, but it is difficult to summarize everything (I did a very detailed answer at https://webmasters.stackexchange.com/a/122160/75842 which should help you).
For the computers, ê being either Portuguese or Norwegian does not make a difference at the DNS level. In the same way that at the Unicode level, ê is
"U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX" that is just defined as a "Latin" character, irrespective to which language might use it.
In short:
the IETF invented the Punycode algorithm, and more precisely the IDNA standard just to make sure that people could use (almost) any character in their domain name. As such the algorithm is just a translation from "any Unicode string" to "an ASCII string starting with xn--"
The domain name industry, with ICANN and all registries, then decide on rules on top of that. For example there is a major rule "you can not mix characters from multiple scripts in the same string", to avoid IDN homograph attacks mostly (so not really a technical constraint); my answer above gets in full details on this.
At the EPP level, various actors created various extensions, there is no real standardized "IDN" specification here. Which is also why you will find people speaking about "scripts", other about "languages", other about "repertoire", etc. It is a mess (Unicode only speaks about scripts, not languages). Some registries do not use any extension, while others do. Some want you to always pass an IDN "table" (aka script/language/whatever) reference, some will require it only in some cases. For example look at Verisign IDN practices at https://www.verisign.com/en_US/channel-resources/domain-registry-products/idn/idn-policy/registration-rules/index.xhtml; It boils down to "all IDN registrations need a language tag; some of them are attached to specific list of possible characters"
You can find in theory all but in practice only most of IDN tables existing at https://www.iana.org/domains/idn-tables and you can see they are per registry, showing that this extra information is really not encoded in the ASCII form of the domain name, after conversion by Punycode algorithm.
I am trying to understand who is assuming the IDN table here...
There should be no assumption (either it is given by registrar or not given) or there is no IDN table needed (the registry will just do the Punycode conversion in reverse and decide, based on characters found, which table it should be in).
I can see the EPP transaction - it is not using the IDN extension and therefore cannot supply an IDN table to the server, even if it wanted to
Which registry? If you are a registrar, in practice the registry should be able to help you and answer this kind of questions. Note that most of the time (I could write "all the time", but I am not sure no counter example exists or at least I have none in mind right now), during EPP domain:check you just pass the name (in ASCII form) without any IDN extension, while you pass the IDN extension, if any, during the domain:create. Which also means that the domain:check might not get you the proper full reply, just because at that point not everything is known.
See these EPP documents on IDN extensions:
https://datatracker.ietf.org/doc/html/draft-ietf-eppext-idnmap-02
https://datatracker.ietf.org/doc/html/draft-wilcox-cira-idn-eppext
https://tools.ietf.org/id/draft-gould-idn-table-07.html
https://datatracker.ietf.org/doc/html/draft-sienkiewicz-epp-idn-00

WebApi Url should allow special characters like forward slash(/) , ( and )

Some of web api endpoints have strings as input parameters and some times I have to pass special characters like /,\,( and ). But, it is not allowing special characters because those has special functionalities. Is there any way solve this problem.
Is there any way solve this problem.
No.
(Please) Stop Using Unsafe Characters in URLs. URLs are by definition machine readable. There are rules to follow about what can and cannot be part of a URL.
Although technically using unsafe characters is supposed to be possible through encoding, all bets are off that all browsers, web servers, and firewalls will treat them as "plain text" when encoded instead of assigning them special meaning. Some may just reject the URL entirely, considering it SPAM or attempted hacking.

Check if a regex is a subset of another or equal

I have a page where a user can add an IP address to a whitelist, whose format is verified if it is a valid IP.
I'd like to add functionality so that regex's can also be input. I would like to verify that the regex matches a valid IP address (ie. the regex entered by the user is a subset of the regex that is specified in the code).
IP_Regex: ^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$
Example: A user must input a string matching the specifications of IP_Regex (such as 10.111.111.111) or a subset of it (such as 12(?>\.\d{1,3}){3})
I'm not sure how to go about this. Most posts seem to just cite math theory but don't mention how to go about this when programming.
I don't think it is dangerous to allow your users to input regexes, so you don't have to be 100% accurate.
Therefore I would randomly generate some slightly invalid ips and make sure the regexes fail on those.

Should I capture email addresses in lowercase in an uppercase system?

Your opinions please.
In our system the decision was made that all fields (except notes) will be forced into uppercase. I don't like it, but the system's intended users aren't very computer literate, so for line-of-business use I suppose it is acceptable.
When it comes to email addresses however, would you also force them into uppercase, against the lowercase convention?
The part before # "Local Part" is "locally interpreted" (see RFC 2822 §3.4.1).
This could include being case-sensitive. Whether any email processing agents support case sensitivity is another question.
Email addresses are case-insensitive, so it doesn't matter. Upper case would be fine.

Are unescaped user names incompatible with BNF?

I've got a (proprietary) output from a software that I need to parse. Sadly, there are unescaped user names and I'm scratching my hairs trying to know if I can, or not, describe the files I need to parse using a BNF (or EBNF or ABNF).
The problem, oversimplified (it's really just an example), may look like this:
(data) ::= <username>
<username> ::= (other type of data)
And in some case, instead of appearing at the left or at the right, the username can also appear in the middle of a line.
The problem is that the username is unescaped and there are not enough restrictions on user names (they're printable ASCII, max 20 chars and they can't contain line break). So "=" would be a perfectly valid username, for example. And so would "= 1 = john = 2" (because user, at sign-on, where allowed to choose any user name they wanted and these appear unescaped in the output I've got).
I'm asking because my parser chocked on some very creative usernames (once again, not in my control, they're "weird" and I need to deal with it) and I cannot find an easy way to deal with this. Also note that I do not know in advance the user names (for example I don't have access to a database that would contain all the user names that the users created).
So are unrestricted and unescaped user names incompatibles with BNF?
P.S: be cool with me if I made mistakes, it's my first post on stackoverflow :)
BNF doesn't "care" for user names per-se. It works on the token level. If you define a username token, you can build describe a grammar using BNF based on it.
Your problem should be solved on the lexer level. The lexer should be smart enough to recognize user names, even when they're not escaped, and pass username tokens to the parser.
In theory you could describe all kinds of user names with a grammar, but this heavily depends on the other things in your language. Is = a valid token on its own right? How can you tell a username having = in it apart if it is? I think you'll have to describe the rest of the rules and valid tokens in your language to get a fuller answer here.
It might be possible to work by recognising things that are not usernames and then declaring everything else a username, even if this means parsing from right to left instead of left to right or doing something equally eccentric.
It may be worth looking to see if your input is actually ambiguous: can you find two different situations that lead to identical output being generated? If so, you need to go back and get requirements for which of them to favour, or what sort of error to produce, or whatever. If not, the reason why not might help you work out what your parser or lexer or whatever needs to do.

Resources