Regex for phone number validation - ruby

I have to find in string a phone number with conditions:
Start with 0
with 10 or 11 number 0-9
with maximum 2 character "-" (Not at start or end)
Example: 01234567890, 01-234567890, 03-1234-12345.
My regex, but it not work:
/\d+{10,11}|(\d+\-\d+){11,12}|(\d+\-\d+\-\d+){12,13}/

It is a bit tricky. First, your regexp kind of has the right idea. Given that the length changes with number of dashes, we need to check each case separately. (There might be a better way, but I can't think of one.) However, (\d+-\d+){11,12} does not mean "length being 11-12", but "11-12 repetitions of \d+-\d+, giving you way more than 11-12 characters. Even if it were correct, because of the order of the disjunction, you would not be able to match 0123456789-1, because 10 digits would be found first, and ten digits followed by dash and another digit would not even be checked.
If you were trying to validate the whole string, it would have been easier, as you can use anchors ^ and $ to find the end. Without it, it is a little trickier:
(?=[\d-]{13,14}(?![\d-]))0\d+-\d+-\d+(?![\d-])|(?=[\d-]{12,13}(?!-|\d))0\d+-\d+(?![\d-])|\d{10,11}
The first part, (?=[\d-]{13,14}(?![\d-]))0\d+-\d+-\d+(?![\d-]), checks for the two-dash pattern. (?=[\d-]{13,14}(?![\d-])) checks whether you have 13-14 digit-or-dash characters after which you don't have a digit nor a dash. After making sure there is such a region, we make sure there are exactly two dashes in between digits (and making sure the whole thing is, again, not followed by a digit-or-dash - this anchor synchronises the condition in our lookahead and in the main pattern).
The second part, (?=[\d-]{12,13}(?!-|\d))0\d+-\d+(?![\d-]), is analogous, checking for one-dash matches. The third part, \d{10,11}, is trivially simple, and finds no-dash matches.
All of this is under the assumption that sawa's needling is on-point: that 0123456789- is not a match. If it is, you will need to change some plusses into stars.
Rubular
EDIT: The Rubular pattern still has the wrong \d{11,12} for the dashless case, can't be bothered to generate another Rubular :P
EDIT2: Thought of a better way.
(?=(?:\d-?){10,11}(?![\d-]))\d+(-\d+){0,2}(?![\d-])
Make sure there's 10-11 digits, and make sure there's 0-2 dashes. The anchor idea is the same as in the previous one.
Rubular.

Related

Turing Machine Algorithm

Could you please help me? I need to write code for a one-tape Turing Machine that uses the following two-letter alphabet a and b.
So the programme should show the common prefix of the two words.
For example:
g(aab,aaaba) -> aa; g(_,abab) -> _; g(aaba,baa) -> _; g(_,_) -> _; g(babaab,babb) -> bab
Where g is the function of the Machine and underscore means an Empty word, between words we have space
I tried to implement the following option:
If at the start we see the letter a, then we erase it and move to the beginning of the second word. If we also see a letter a there, we erase it too and after both words we write a through a space. After that we return to the beginning of the first word and repeat this operation. When the first letter of the first word and the first letter of the second no longer match, we erase everything that is left.
But I have some troubles with code, because after each operation a space between two words gets longer and I don't know how to control this. Also there is a trouble when the first or the second word is a common prefix fully, like this:
g(baa,baabab) -> baa
Your approach seems reasonable. From your description it sounds like you just have trouble generalizing some of the individual steps.
For instance, to deal with the growing spaces between the two words, remember that at any time in the program, the two words are separated by one or more spaces. So implement your seek operation for that general case.
For the common prefix case you have to deal with the situation that you eventually run out of characters to compare. So after deleting the character from the first word, while seeking for the beginning of the second word, check whether the first character you pass over is a letter or a space. If it's a space, you're in the prefix case and need to take care that you don't try to seek back to the first word later, because you already erased all of it and there's only spaces left. Similarly, if the second word is the prefix, you can detect this when seeking to the output.
Try breaking the algorithm down into its essential steps and test each of those steps in isolation. It is much easier to make sure you handle the corner cases correctly when you can focus on a simple step in isolation, instead of having to test it as part of the larger algorithm. Doing this is an essential skill in debugging code, so consider this a good exercise for that. Even if it seems painful at first, make sure you have a structured approach to analyzing problems and breaking your code down into smaller parts, and you will be able to fix any problems eventually. Happy coding!

Need help understanding why this string in grep pulls IP addresses rather than this other string

The following statement is from a homework question which I tested out and answered, but I'm just not understanding how come this line behaves the way it does and I want to understand why. I realize why this expression is flawed to find an IP address but I don't fully understand why it behaves the way it does since it seems as if the question mark doesn't actually behave as 0 or 1 times in like it's supposed to.
"user#machine:~$ grep -E '[01]?[0-9][0-9]?' "
To my understanding "[01]?" should look for any number 0-1 as indicated by the brackets while the question mark tells grep to look for zero or one instance only and similar with "[0-9]?". Thing is this line will print an unlimited number of digits far exceeding 3 digits. I ruled out that it was due to the 3rd bracket that didn't have a proceeding question mark since it would still print an unlimited amount of digits if I piped an echo or used a testing .txt file full of numbers.
This above example made me than wonder how to find IP's with grep the correct way. So I found countless examples like the following expression for IPv4 octets:
\.(25[0-5]\|2[0-4][0-9]\|[01][0-9][0-9]\|[0-9][0-9]).\
Is this telling me to look for any number 2-5 anywhere from 0-5 times? 0-5 is too many digits for an octet. Is it telling me to look for any number 0-5 up to 25 times? Again that's way too many digits for an octet. What does \2[0-4][0-9]\ mean in this case? I'm confused about how this expression finds numbers strictly between 1-255?
Look at it this way: x?[0-9]x? matches anything which contains a digit because both the x:es are optional. You might as well leave them out because they do not constrain the match at all.
25[0-5] looks for 25 followed by a digit in the range 0-5. In other words, the expression matches a number in the range 250-255.
The full expression in your example looks for a number in the range 00-255 by enumerating strings beginning with 25, 20-24, etc; though it's incomplete in that it doesn't permit single-digit numbers.
The expression matches a single octet (incompletely), not an entire IP address. Here is a common way to match an IPv4 address:
([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?)(\.([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?){3}
where the square brackets express character classes which match a single character out of a set, and the final curly braces {3} express a repetition.
Some regex dialects (e.g. POSIX grep) require backslashes before | and \( but I have used the extended notation (a la grep -E and most online regex exploration tools) which doesn't want backslashes.

Algorithm for matching cards to a set of rules

I've run into a peculiar problem which I don't seem to be able to wrap my head around. I'll get right into it.
The problem is matching a set of cards to a set of rules.
It is possible to define a set of rules as a string. It is composed of comma separated tuples of <suit>:<value>. For example H:4, S:1 should match Four of Hearts and Ace of Spades. It is also possible to wildcard, for example *:* matches any card, D:* matches any card with in the diamond suit, and *:2matches a Two in any suit. Rules can be combined with comma: *:*,*:*,H:4 would match a set of cards if it held 2 random cards and a Four of Hearts.
So far so good. A parser for this is easy and straight forward to write. Here comes the tricky part.
To make it easy to compose these rules, two more constructions can be used for suit and value. These are < (legal for suit and value) and +n (legal only for value) where n is a number. < means "the same as previous match" and +n means "n higher than previous match". An example:
*:*, <:*, *:<
Means: match any card, then match a card with the same suit as the first match, next match another card with the same value as the second match. This hand would match:
H:4,H:8,C:8
Because Hearts of Four and Hearts of Eight is the same suit, while Eight of Hearts and Eight of Clubs is the same value.
It is allowed to have more cards as long as all rules match (so, adding C:10 to the above hand would still match the rule).
My first approach at solving this is basically taking the set of cards which should be matched, attempting to apply the first rule to it. If it matched, I moved on to the next rule and attempted to match it from the set of cards, and so on until either all rules were matched, or I found a rule that didn't match. This approach have (at least) one flaw, consider example above above: *:*,<:*,*:<, but with the cards in this order: H:8,C:8,H:4.
It would match the H:8 of for the first rule. Matched: H:8
Next it attempts to find one with the same suit (Hearts). There is a Four of Hearts. Matched: H:8, H:4
Moving on, it want to find a card with the same value (Four), and fails.
I don't want the way the set of cards is ordered to have any impact on the result as it does in the above example. I could sort the set of cards if I could think of any great strategy that worked well with any set of rules.
I have no knowledge of the quantity of cards or number oof rules, so a brute force approach is not feasible.
Thank you for reading this far, I am grateful for any tip or insight.
Your problem is actually an ordering problem. Here's a simple version for it:
given an input sequence of numbers and a pattern, reorder them so that they fit the pattern. The pattern can contain "*", meaning "any number" and ">", meaning "bigger than the previous number.
For example, given pattern [* * > >] and sequence [10 10 2 1] such an ordering exists and it is [10 1 2 10]. Some inputs might give no outputs, others 1, while even others many (think the input [10 10 2 1] and the pattern [* * * *]).
I'd say that once you have the solution for this simplified problem, switching to your problem is just a matter of adding another dimension and some operators. Sorry for not being of more help :/ .
LE. keep in mind that if the allowed character symbols are finite (i.e. 4) and also the allowed numbers (i.e. 9) things might get easier.

Password validation

I need to validate a password with the following requirements:
1. Be at least seven characters long
2. Contain at least one letter (a-z or A-Z)
3 Contain at least one number (0-9)
4 Contain at least one symbol (#, $, %, etc)
Can anyone give me the correct expression?
/.{7,}/
/[a-zA-Z]/
/[0-9]/
/[-!##$%^...]/
For a single regex, the most straightforward way to check all of the requirements would be with lookaheads:
/(?=.*[a-zA-Z])(?=.*\d)(?=.*[^a-zA-Z0-9\s]).{7,}/
Breaking it down:
.{7,} - at least seven characters
(?=.*[a-zA-Z]) - a letter must occur somewhere after the start of the string
(?=.*\d) - ditto 2, except a digit
(?=.*[^a-zA-Z0-9\s]) - ditto 2, except something not a letter, digit, or whitespace
However, you might choose to simply utilize multiple separate regex matches to keep things even more readable - chances are you aren't validating a ton of passwords at once, so performance isn't really a huge requirement.

How to elegantly compute the anagram signature of a word in ruby?

Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.

Resources