Should I capture email addresses in lowercase in an uppercase system? - user-interface

Your opinions please.
In our system the decision was made that all fields (except notes) will be forced into uppercase. I don't like it, but the system's intended users aren't very computer literate, so for line-of-business use I suppose it is acceptable.
When it comes to email addresses however, would you also force them into uppercase, against the lowercase convention?

The part before # "Local Part" is "locally interpreted" (see RFC 2822 §3.4.1).
This could include being case-sensitive. Whether any email processing agents support case sensitivity is another question.

Email addresses are case-insensitive, so it doesn't matter. Upper case would be fine.

Related

Does a Punycode domain name (UName) store the IDN table used?

I've created a domain name such as: même.vip
I can see in the database, that the domain name has been registered with IDN table: "fr".
However, 'ê' can be Portuguese, Norwegian, etc...
I am trying to understand who is assuming the IDN table here...
I can see the EPP transaction - it is not using the IDN extension and therefore cannot supply an IDN table to the server, even if it wanted to
I cannot access the code that populated that DB record
Therefore, my best chance is to know if the Punycode domain name contains information on which table was used. If not: then I know it's the DB or some service at the registry, after the EPP command.
(Of course, if the punycode DOES contain the IDN table, then I have more digging to do!)
Does a Punycode domain name (UName) store the IDN table used?
TL;DR: No.
You are mixing multiple things, but it is difficult to summarize everything (I did a very detailed answer at https://webmasters.stackexchange.com/a/122160/75842 which should help you).
For the computers, ê being either Portuguese or Norwegian does not make a difference at the DNS level. In the same way that at the Unicode level, ê is
"U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX" that is just defined as a "Latin" character, irrespective to which language might use it.
In short:
the IETF invented the Punycode algorithm, and more precisely the IDNA standard just to make sure that people could use (almost) any character in their domain name. As such the algorithm is just a translation from "any Unicode string" to "an ASCII string starting with xn--"
The domain name industry, with ICANN and all registries, then decide on rules on top of that. For example there is a major rule "you can not mix characters from multiple scripts in the same string", to avoid IDN homograph attacks mostly (so not really a technical constraint); my answer above gets in full details on this.
At the EPP level, various actors created various extensions, there is no real standardized "IDN" specification here. Which is also why you will find people speaking about "scripts", other about "languages", other about "repertoire", etc. It is a mess (Unicode only speaks about scripts, not languages). Some registries do not use any extension, while others do. Some want you to always pass an IDN "table" (aka script/language/whatever) reference, some will require it only in some cases. For example look at Verisign IDN practices at https://www.verisign.com/en_US/channel-resources/domain-registry-products/idn/idn-policy/registration-rules/index.xhtml; It boils down to "all IDN registrations need a language tag; some of them are attached to specific list of possible characters"
You can find in theory all but in practice only most of IDN tables existing at https://www.iana.org/domains/idn-tables and you can see they are per registry, showing that this extra information is really not encoded in the ASCII form of the domain name, after conversion by Punycode algorithm.
I am trying to understand who is assuming the IDN table here...
There should be no assumption (either it is given by registrar or not given) or there is no IDN table needed (the registry will just do the Punycode conversion in reverse and decide, based on characters found, which table it should be in).
I can see the EPP transaction - it is not using the IDN extension and therefore cannot supply an IDN table to the server, even if it wanted to
Which registry? If you are a registrar, in practice the registry should be able to help you and answer this kind of questions. Note that most of the time (I could write "all the time", but I am not sure no counter example exists or at least I have none in mind right now), during EPP domain:check you just pass the name (in ASCII form) without any IDN extension, while you pass the IDN extension, if any, during the domain:create. Which also means that the domain:check might not get you the proper full reply, just because at that point not everything is known.
See these EPP documents on IDN extensions:
https://datatracker.ietf.org/doc/html/draft-ietf-eppext-idnmap-02
https://datatracker.ietf.org/doc/html/draft-wilcox-cira-idn-eppext
https://tools.ietf.org/id/draft-gould-idn-table-07.html
https://datatracker.ietf.org/doc/html/draft-sienkiewicz-epp-idn-00

Check if a regex is a subset of another or equal

I have a page where a user can add an IP address to a whitelist, whose format is verified if it is a valid IP.
I'd like to add functionality so that regex's can also be input. I would like to verify that the regex matches a valid IP address (ie. the regex entered by the user is a subset of the regex that is specified in the code).
IP_Regex: ^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$
Example: A user must input a string matching the specifications of IP_Regex (such as 10.111.111.111) or a subset of it (such as 12(?>\.\d{1,3}){3})
I'm not sure how to go about this. Most posts seem to just cite math theory but don't mention how to go about this when programming.
I don't think it is dangerous to allow your users to input regexes, so you don't have to be 100% accurate.
Therefore I would randomly generate some slightly invalid ips and make sure the regexes fail on those.

Extract relevant address from string?

I am developing an address matching application using Google geocoding API.
The problem is that some of the addresses in the database I am trying to validate are something like:
ATTN: Mr. THOMAS WONG 2457 Yonge St., Toronto, ON, N2S 2V5, Canada
rather than
2457 Yonge St., Toronto, ON, N2S 2V5, Canada
The first string returns null results (because it starts with a person's name), the second one will validate and return a full correct address.
My question is: What would be the right approach to this issue?
I am thinking of a way to extract only the relevant part from the address string (with some function) but maybe there are better ideas?
Thank you,
M.R.
I work at SmartyStreets and wrote the address extractor which we now offer with LiveAddress API. It's hard. There are a lot of assumptions you need to force yourself not to make, including "if the address starts with a number." (Sorry DwB -- there's a lot to consider.)
If you have US addresses, you may still find our tool useful (it's free to sign up and use, to a point). Here's another Stack Overflow post about the extraction utility: https://stackoverflow.com/a/16448034/1048862
The best way to do this would be to use an address validation service -- one that can validate delivery points and not just address ranges (which is most common, so be wary of claims to "address validation" when it's really just guessing within certain bounds).
Be aware, too, that Google does not validate addresses. It may standardize them, and will return results where the address would exist if it were real, and if it is actually valid, it's your lucky day.
If the desired part of the address always starts with a number, try this:
find the first digit in the string.
get a substring from the first digit to the end of the string.
you now have the address.
In order to parse addresses, you need to know all possible formats.
Do you need to include:
Santa, North Pole.
The Queen, Great Britian
Captian Hootberry
Bob Goldenberry, rural route 7, MN
Jackie Blam, P.O. Box 78, Hootville, OH
For a comprehensive address parsing solution, you will need to provide several algorithms for different address formats then determine which algorithm to use based on the input.

What are the valid characters in the domain part of e-mail address?

Intention
I'm trying to do some minimal very minimal validation of e-mail addresses, despite seeing a lot of advice advising against doing that. The reason I'm doing this is that spec I am implementing requires e-mail addresses to be in this format:
mailto:<uri-encoded local part>#<domain part>
I'd like to simply split on the starting mailto: and the final #, and assume the "local part" is between these. I'll verify that the "local part" is URI encoded.
I don't want to do much more than this, and the spec allows for me to get away with "best effort" validation for most of this, but is very specific on the URI encoding and the mailto: prefix.
Problem
From everything I've read, splitting on the # seems risky to me.
I've seen a lot of conflicting advice on the web and on Stack Overflow answers, most of it saying "read the RFCs", and some of it saying that the domain part can only be certain characters, i.e. 1-9 a-z A-Z -., maybe a couple other characters, but not much more than this. E.g.:
What characters are allowed in an email address?
When I read various RFCs on domain names, I see that "any CHAR" (dtext) or "any character between ASCII 33 and 90" (dtext) are allowed, which implies # symbols are allowed. This is further compounded because "comments" are allowed in parens ( ) and can contain characters between ASCII 42 and 91 which include #.
RFC1035 seems to support the letters+digits+dashes+periods requirement, but "domain literal" syntax in RFC5322 seems to allow more characters.
Am I misunderstanding the RFC, or is there something I'm missing that disallows a # in the domain part of an e-mail address? Is "domain literal" syntax something I don't have to worry about?
The most recent RFC for email on the internet is RFC 5322 and it specifically addresses addresses.
addr-spec = local-part "#" domain
local-part = dot-atom / quoted-string / obs-local-part
The dot-atom is a highly restricted set of characters defined in the spec. However, the quoted-string is where you can run into trouble. It's not often used, but in terms of the possibility that you'll run into it, you could well get something in quotation marks that could itself contain an # character.
However, if you split the string from the last #, you should safely have located the local-part and the domain, which is well defined in the specification in terms of how you can verify it.
The problem comes with punycode, whereby almost any Unicode character can be mapped into a valid DNS name. If the system you are front-ending can understand and interpret punycode, then you have to handle almost anything that has valid unicode characters in it. If you know you're not going to work with punycode, then you can use a more restricted set, generally letters, digits, and the hyphen character.
To quote the late, great Jon Postel:
TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.
Side note on the local part:
Keeping in mind, of course, that there are probably lots of systems on the internet that don't require strict adherence to the specs and therefore might allow things outside of the spec to work due to the long standing liberal-acceptance/conservative-transmission philosophy.

Validating FirstName in a web application

I do not want to be too strict as there may be thousands of possible characters in a possible first name
Normal english alphabets, accented letters, non english letters, numbers(??), common punctuation synbols
e.g.
D'souza
D'Anza
M.D. Shah (dots and space)
Al-Rashid
Jatin "Tom" Shah
However, I do not want to except HTML tags, semicolons etc
Is there a list of such characters which is absolutely bad from a web application perspective
I can then use RegEx to blacklist these characters
Background on my application
It is a Java Servlet-JSP based web app.
Tomcat on Linux with MySQL (and sometimes MongoDB) as a backend
What I have tried so far
String regex = "[^<>~##$%;]*";
if(!fname.matches(regex))
throw new InputValidationException("Invalid FirstName")
My question is more on the design than coding ... I am looking for a exhaustive (well to a good degree of exhaustiveness) list of characters that I should blacklist
A better approach is to accept anything anyone wants to enter and then escape any problematic characters in the context where they might cause a problem.
For instance, there's no reason to prohibit people from using <i> in their names (although it might be highly unlikely that it's a legit name), and it only poses a potential problem (XSS) when you are generating HTML for your users. Similarly, disallowing quotes, semi-colons, etc. only make sense in other scenarios (SQL queries, etc.). If the rules are different in different places and you want to sanitize input, then you need all the rules in the same place (what about whitespace? Are you gong to create filenames including the user's first name? If so, maybe you'll have to add that to the blacklist).
Assume that you are going to get it wrong in at least one case: maybe there is something you haven't considered for your first implementation, so you go back and add the new item(s) to your blacklist. You still have users who have already registered with tainted data. So, you can either run through your entire database sanitizing the data (which could take a very very long time), or you can just do what you really have to do anyway: sanitize data as it is being presented for the current medium. That way, you only have to manage the sanitization at the relevant points (no need to protect HTML output from SQL injection attacks) and it will work for all your data, not just data you collect after you implement your blacklist.

Resources