Verify string in MVC validator using regularexpressions - model-view-controller

I am trying to grasp the concept of Regular Expressions but seem to be missing something.
I want to ensure that someone enters a string that ends with .wav in a field. Should be a pretty simple Regular Expression.
I've tried this...
[RegularExpression(#"$.wav")]
but seem to be incorrect. Any help is appreciated. Thanks!

$ is the anchor for the end of the string, so $.wav doesn't make any sense. You can't have any characters after the end of the string. Also, . has a special meaning for regex (it just means 'any character') so you need to escape it.
Try writing
\.wav$
If that doesn't work, try
.*\.wav$
(It depends on if the RegularExpression attribute wants to match the whole string, or just a part of it. .* means 'any character, 0 or more times')
Another thing you should consider is what to do with extra whitespace in the field. Users have a terrible habit of adding extra white space in inputs - its why various .Trim() functions are so important. Here, RegularExpressionAttribute might be evaluated before you can trim the input, so you might want to write this:
.*\.wav[\s]*$
The [\s]* section means 'any whitespace character (tabs, space, linebreak, etc) 0 or more times'.
You should read a tutorial on regex. It's not so hard to understand for simple problems like this. When I was learning I found this site pretty handy: http://www.regular-expressions.info/

Related

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

Using Regex to Extract between __italic__ and some other

I am writing code to extract some data between (italic, --bold--) characters. (Very similar to SO comment feature)
I actually wrote the method for that (using a loop and checking characters), but I wondered if I can re-write that method using Regex.
I tried Rubular, but I am not that good at Regex:
This kinda works for italic, but I think it is not a good solution for using all other special chars (like -- and possibly others)
regex: _{2}([^_]*)_{2}
text: __word1__ not_italic __a__ --bolder--
Is it possible to do that with a 1 match call and regex, or do I have to crete special regex's for each special formatting characters?
Sure you can. Here's a nifty construct you can use: (__|--)((?:(?!\1).)+)\1
Demo + explanation: http://regex101.com/r/tO4tW1
The content you're after will be in the second backreference every time.

Matching an unescaped balanced pair of delimiters

How can I match a balanced pair of delimiters not escaped by backslash (that is itself not escaped by a backslash) (without the need to consider nesting)? For example with backticks, I tried this, but the escaped backtick is not working as escaped.
regex = /(?!<\\)`(.*?)(?!<\\)`/
"hello `how\` are` you"
# => $1: "how\\"
# expected "how\\` are"
And the regex above does not consider a backslash that is escaped by a backslash and is in front of a backtick, but I would like to.
How does StackOverflow do this?
The purpose of this is not much complicated. I have documentation texts, which include the backtick notation for inline code just like StackOverflow, and I want to display that in an HTML file with the inline code decorated with some span material. There would be no nesting, but escaped backticks or escaped backslashes may appear anywhere.
Lookbehind is the first thing everyone thinks of for this kind of problem, but it's the wrong tool, even in flavors like .NET that support unrestricted lookbehinds. You can hack something up, but it's going to be ugly, even in .NET. Here's a better way:
`[^`\\]*(\\.[^`\\]*)*`
The first part starts from the opening delimiter and gobbles up anything that's not the delimiter or a backslash. If the next character is a backslash, it consumes that and the character following it, whatever it may be. It could be the delimiter character, another backslash, or anything else, it doesn't matter.
It repeats those steps as many times as necessary, and when neither [^`\\] nor \\. can match, the next character must be the closing delimiter. Or the end of the string, but I'm assuming the input is well formed. But if it's not well formed, this regex will fail very quickly. I mention that because of this other approach I see a lot:
`(?:[^`\\]+|\\.)*`
This works fine on well-formed input, but what happens if you remove the last backtick from your sample input?
"hello `how\` are you"
According to RegexBuddy, after encountering the first backtick, this regex performed 9,252 distinct operations (or steps) before it could give up and report failure; mine failed in ten steps.
EDIT To extract just the par inside the delimiters, wrap that part in a capturing group. You'll still have to remove the backslashes manually.
`([^`\\]*(?:\\.[^`\\]*)*)`
I also changed the other group to non-capturing, which I should have done from the start. I don't avoid capturing religiously, but if you are using them to capture stuff, any other groups you use should be non-capturing.
EDIT I think I've been reading too much into the question. On StackOverflow, if you want to include literal backticks in an inline-code segment or a comment, you use three backticks as the the delimiter, not just one. Since there's no need to escape backticks, you can ignore backslashes as well. Your regex could turn out to be as simple as this:
```(.*?)```
Dealing with the possibility of false delimiters, you use the same basic technique:
```([^`]*(?:`(?!``)[^`]*)*)```
Is this what you're after?
By the way, this answer doesn't contradict #nneonneo's comment above. This answer doesn't consider the context in which the match is taking place. Is it in the source code of a program or web page? If it is, did the match occur inside a comment or a string literal? How do I even know the first backtick I found wasn't escaped? Regexes don't know anything about the context in which they operate; that's what parsers are for.
If you don't need nesting, regexes can indeed be a proper tool. Lexers of programming languages, for instance, use regexes to tokenize strings, and strings usually allow their own delimiters as an escaped content. Anything more complicated than that will probably need a full-blown parser though.
The "general formula" is to match an escaped character (\\.) or any character that's valid as content but don't need to be escaped ([^{list of invalid chars}]). A "naïve" solution would be joining them with or (|), but for a more efficient variant see #AlanMoore's answer.
The complete example is shown below, in two variants: the first assumes than backslashes should only be used for escaping inside the string, the second assumes that a backslash anywhere in the text escapes the next character.
`((?:\\.|[^`\\])*)`
(?:\\.|[^`\\])*`((?:\\.|[^`\\])*)`
Working examples here and here. However, as #nneonneo commented (and I endorsed), regexes are not meant to do a complete parse, so you'd better keep things simple if you want them to work out right (do you want to find a token in the text, or do you want to delimit it already knowing where it starts? The answer to that question is important to decide which strategy works best for your case).

Using regexes in ruby with a need to match lots of * and /

I need to find strings with * and / using reg-exes, I am writing in Ruby.The reason for this need to find lots of * and / is that I am building a tokenizer for an language and there are multi-line comments that use the C style of multi-line comments (/* */). I have the single line comments handled already.
Is there a way to use reg-ex without having to use the two foreword slashes to indicate some regular expression because I am finding it impossible to find my mistakes due to the insane amount of escaping. Or can someone give me advise on how to handle the escaping in a sane matter? I already tried writing the sequence first then escaping it.
Thank you for your time and advise.
One trick that might help is the %r literal:
%r{http://www\.google\.com}
I like to use pipes myself, when they're not in the regex.
%r|http://www\.google\.com|
You can also create new instances of Regexp via Regexp.new and pass a string.
Finally, you might also look at Regexp.quote:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.

regex to match trailing whitespace, but not lines which are entirely whitespace (indent placeholders)

I've been trying to construct a ruby regex which matches trailing spaces - but not indentation placeholders - so I can gsub them out.
I had this /\b[\t ]+$/ and it was working a treat until I realised it only works when the line ends are [a-zA-Z]. :-( So I evolved it into this /(?!^[\t ]+)[\t ]+$/ and it seems like it's getting better, but it still doesn't work properly. I've spent hours trying to get this to work to no avail. Please help.
Here's some text test so it's easy to throw into Rubular, but the indent lines are getting stripped so it'll need a few spaces and/or tabs. Once lines 3 & 4 have spaces back in, it shouldn't match on lines 3-5, 7, 9.
some test test
some test test
some other test (text)
some other test (text)
likely here{ dfdf }
likely here{ dfdf }
and this ;
and this ;
Alternatively, is there an simpler / more elegant way to do this?
If you're using 1.9, you can use look-behind:
/(?<=\S)[\t ]+$/
but unfortunately, it's not supported in older versions of ruby, so you'll have to handle the captured character:
str.gsub(/(\S)[\t ]+$/) { $1 }
Your first expression is close, and you just need to change the \b to a negated character class. This should work better:
/([^\t ])[\t ]+$
In plain words, this matches all tabs and spaces on lines that follow a character that is not a tab or a space.
Wouldn't this help?
/([^\t ])([\t ]+)$/
You need to do something with the matched last non-space character, though.
edit: oh, you meant non blank lines. Then you would need something like /([^\s])\s+/ and sub them with the first part
I'm not entirely sure what you are asking for, but wouldn't something like this work if you just want to capture the trailing whitespaces?
([\s]+)$
or if you only wanted to capture tabs
([ \t]+)$
Since regexes are greedy, they'll capture as much as they can. You don't really need to give them context beforehand if you know what you want to capture.
I still am not sure what you mean by trailing indentation placeholders, so I'm sorry if I'm misunderstanding.
perhaps this...
[\t|\s]+?$
or
[ ]+$

Resources