Check if a regex is a subset of another or equal - ruby

I have a page where a user can add an IP address to a whitelist, whose format is verified if it is a valid IP.
I'd like to add functionality so that regex's can also be input. I would like to verify that the regex matches a valid IP address (ie. the regex entered by the user is a subset of the regex that is specified in the code).
IP_Regex: ^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$
Example: A user must input a string matching the specifications of IP_Regex (such as 10.111.111.111) or a subset of it (such as 12(?>\.\d{1,3}){3})
I'm not sure how to go about this. Most posts seem to just cite math theory but don't mention how to go about this when programming.

I don't think it is dangerous to allow your users to input regexes, so you don't have to be 100% accurate.
Therefore I would randomly generate some slightly invalid ips and make sure the regexes fail on those.

Related

Extract relevant address from string?

I am developing an address matching application using Google geocoding API.
The problem is that some of the addresses in the database I am trying to validate are something like:
ATTN: Mr. THOMAS WONG 2457 Yonge St., Toronto, ON, N2S 2V5, Canada
rather than
2457 Yonge St., Toronto, ON, N2S 2V5, Canada
The first string returns null results (because it starts with a person's name), the second one will validate and return a full correct address.
My question is: What would be the right approach to this issue?
I am thinking of a way to extract only the relevant part from the address string (with some function) but maybe there are better ideas?
Thank you,
M.R.
I work at SmartyStreets and wrote the address extractor which we now offer with LiveAddress API. It's hard. There are a lot of assumptions you need to force yourself not to make, including "if the address starts with a number." (Sorry DwB -- there's a lot to consider.)
If you have US addresses, you may still find our tool useful (it's free to sign up and use, to a point). Here's another Stack Overflow post about the extraction utility: https://stackoverflow.com/a/16448034/1048862
The best way to do this would be to use an address validation service -- one that can validate delivery points and not just address ranges (which is most common, so be wary of claims to "address validation" when it's really just guessing within certain bounds).
Be aware, too, that Google does not validate addresses. It may standardize them, and will return results where the address would exist if it were real, and if it is actually valid, it's your lucky day.
If the desired part of the address always starts with a number, try this:
find the first digit in the string.
get a substring from the first digit to the end of the string.
you now have the address.
In order to parse addresses, you need to know all possible formats.
Do you need to include:
Santa, North Pole.
The Queen, Great Britian
Captian Hootberry
Bob Goldenberry, rural route 7, MN
Jackie Blam, P.O. Box 78, Hootville, OH
For a comprehensive address parsing solution, you will need to provide several algorithms for different address formats then determine which algorithm to use based on the input.

What are the valid characters in the domain part of e-mail address?

Intention
I'm trying to do some minimal very minimal validation of e-mail addresses, despite seeing a lot of advice advising against doing that. The reason I'm doing this is that spec I am implementing requires e-mail addresses to be in this format:
mailto:<uri-encoded local part>#<domain part>
I'd like to simply split on the starting mailto: and the final #, and assume the "local part" is between these. I'll verify that the "local part" is URI encoded.
I don't want to do much more than this, and the spec allows for me to get away with "best effort" validation for most of this, but is very specific on the URI encoding and the mailto: prefix.
Problem
From everything I've read, splitting on the # seems risky to me.
I've seen a lot of conflicting advice on the web and on Stack Overflow answers, most of it saying "read the RFCs", and some of it saying that the domain part can only be certain characters, i.e. 1-9 a-z A-Z -., maybe a couple other characters, but not much more than this. E.g.:
What characters are allowed in an email address?
When I read various RFCs on domain names, I see that "any CHAR" (dtext) or "any character between ASCII 33 and 90" (dtext) are allowed, which implies # symbols are allowed. This is further compounded because "comments" are allowed in parens ( ) and can contain characters between ASCII 42 and 91 which include #.
RFC1035 seems to support the letters+digits+dashes+periods requirement, but "domain literal" syntax in RFC5322 seems to allow more characters.
Am I misunderstanding the RFC, or is there something I'm missing that disallows a # in the domain part of an e-mail address? Is "domain literal" syntax something I don't have to worry about?
The most recent RFC for email on the internet is RFC 5322 and it specifically addresses addresses.
addr-spec = local-part "#" domain
local-part = dot-atom / quoted-string / obs-local-part
The dot-atom is a highly restricted set of characters defined in the spec. However, the quoted-string is where you can run into trouble. It's not often used, but in terms of the possibility that you'll run into it, you could well get something in quotation marks that could itself contain an # character.
However, if you split the string from the last #, you should safely have located the local-part and the domain, which is well defined in the specification in terms of how you can verify it.
The problem comes with punycode, whereby almost any Unicode character can be mapped into a valid DNS name. If the system you are front-ending can understand and interpret punycode, then you have to handle almost anything that has valid unicode characters in it. If you know you're not going to work with punycode, then you can use a more restricted set, generally letters, digits, and the hyphen character.
To quote the late, great Jon Postel:
TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.
Side note on the local part:
Keeping in mind, of course, that there are probably lots of systems on the internet that don't require strict adherence to the specs and therefore might allow things outside of the spec to work due to the long standing liberal-acceptance/conservative-transmission philosophy.

Validating FirstName in a web application

I do not want to be too strict as there may be thousands of possible characters in a possible first name
Normal english alphabets, accented letters, non english letters, numbers(??), common punctuation synbols
e.g.
D'souza
D'Anza
M.D. Shah (dots and space)
Al-Rashid
Jatin "Tom" Shah
However, I do not want to except HTML tags, semicolons etc
Is there a list of such characters which is absolutely bad from a web application perspective
I can then use RegEx to blacklist these characters
Background on my application
It is a Java Servlet-JSP based web app.
Tomcat on Linux with MySQL (and sometimes MongoDB) as a backend
What I have tried so far
String regex = "[^<>~##$%;]*";
if(!fname.matches(regex))
throw new InputValidationException("Invalid FirstName")
My question is more on the design than coding ... I am looking for a exhaustive (well to a good degree of exhaustiveness) list of characters that I should blacklist
A better approach is to accept anything anyone wants to enter and then escape any problematic characters in the context where they might cause a problem.
For instance, there's no reason to prohibit people from using <i> in their names (although it might be highly unlikely that it's a legit name), and it only poses a potential problem (XSS) when you are generating HTML for your users. Similarly, disallowing quotes, semi-colons, etc. only make sense in other scenarios (SQL queries, etc.). If the rules are different in different places and you want to sanitize input, then you need all the rules in the same place (what about whitespace? Are you gong to create filenames including the user's first name? If so, maybe you'll have to add that to the blacklist).
Assume that you are going to get it wrong in at least one case: maybe there is something you haven't considered for your first implementation, so you go back and add the new item(s) to your blacklist. You still have users who have already registered with tainted data. So, you can either run through your entire database sanitizing the data (which could take a very very long time), or you can just do what you really have to do anyway: sanitize data as it is being presented for the current medium. That way, you only have to manage the sanitization at the relevant points (no need to protect HTML output from SQL injection attacks) and it will work for all your data, not just data you collect after you implement your blacklist.

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Are unescaped user names incompatible with BNF?

I've got a (proprietary) output from a software that I need to parse. Sadly, there are unescaped user names and I'm scratching my hairs trying to know if I can, or not, describe the files I need to parse using a BNF (or EBNF or ABNF).
The problem, oversimplified (it's really just an example), may look like this:
(data) ::= <username>
<username> ::= (other type of data)
And in some case, instead of appearing at the left or at the right, the username can also appear in the middle of a line.
The problem is that the username is unescaped and there are not enough restrictions on user names (they're printable ASCII, max 20 chars and they can't contain line break). So "=" would be a perfectly valid username, for example. And so would "= 1 = john = 2" (because user, at sign-on, where allowed to choose any user name they wanted and these appear unescaped in the output I've got).
I'm asking because my parser chocked on some very creative usernames (once again, not in my control, they're "weird" and I need to deal with it) and I cannot find an easy way to deal with this. Also note that I do not know in advance the user names (for example I don't have access to a database that would contain all the user names that the users created).
So are unrestricted and unescaped user names incompatibles with BNF?
P.S: be cool with me if I made mistakes, it's my first post on stackoverflow :)
BNF doesn't "care" for user names per-se. It works on the token level. If you define a username token, you can build describe a grammar using BNF based on it.
Your problem should be solved on the lexer level. The lexer should be smart enough to recognize user names, even when they're not escaped, and pass username tokens to the parser.
In theory you could describe all kinds of user names with a grammar, but this heavily depends on the other things in your language. Is = a valid token on its own right? How can you tell a username having = in it apart if it is? I think you'll have to describe the rest of the rules and valid tokens in your language to get a fuller answer here.
It might be possible to work by recognising things that are not usernames and then declaring everything else a username, even if this means parsing from right to left instead of left to right or doing something equally eccentric.
It may be worth looking to see if your input is actually ambiguous: can you find two different situations that lead to identical output being generated? If so, you need to go back and get requirements for which of them to favour, or what sort of error to produce, or whatever. If not, the reason why not might help you work out what your parser or lexer or whatever needs to do.

Resources