I am writing code to extract some data between (italic, --bold--) characters. (Very similar to SO comment feature)
I actually wrote the method for that (using a loop and checking characters), but I wondered if I can re-write that method using Regex.
I tried Rubular, but I am not that good at Regex:
This kinda works for italic, but I think it is not a good solution for using all other special chars (like -- and possibly others)
regex: _{2}([^_]*)_{2}
text: __word1__ not_italic __a__ --bolder--
Is it possible to do that with a 1 match call and regex, or do I have to crete special regex's for each special formatting characters?
Sure you can. Here's a nifty construct you can use: (__|--)((?:(?!\1).)+)\1
Demo + explanation: http://regex101.com/r/tO4tW1
The content you're after will be in the second backreference every time.
Related
I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:
I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?
Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m
A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m
I am trying to grasp the concept of Regular Expressions but seem to be missing something.
I want to ensure that someone enters a string that ends with .wav in a field. Should be a pretty simple Regular Expression.
I've tried this...
[RegularExpression(#"$.wav")]
but seem to be incorrect. Any help is appreciated. Thanks!
$ is the anchor for the end of the string, so $.wav doesn't make any sense. You can't have any characters after the end of the string. Also, . has a special meaning for regex (it just means 'any character') so you need to escape it.
Try writing
\.wav$
If that doesn't work, try
.*\.wav$
(It depends on if the RegularExpression attribute wants to match the whole string, or just a part of it. .* means 'any character, 0 or more times')
Another thing you should consider is what to do with extra whitespace in the field. Users have a terrible habit of adding extra white space in inputs - its why various .Trim() functions are so important. Here, RegularExpressionAttribute might be evaluated before you can trim the input, so you might want to write this:
.*\.wav[\s]*$
The [\s]* section means 'any whitespace character (tabs, space, linebreak, etc) 0 or more times'.
You should read a tutorial on regex. It's not so hard to understand for simple problems like this. When I was learning I found this site pretty handy: http://www.regular-expressions.info/
I'm having an issue trying to capture a group on a string:
"type=gist\nYou need to gist this though\nbecause its awesome\nright now\n</code></p>\n\n<script src=\"https://gist.github.com/3931634.js\"> </script>\n\n\n<p><code>Not code</code></p>\n"
My regex currently looks like this:
/<code>([\s\S]*)<\/code>/
My goal is to get everything in between the code brackets. Unfortunately, it's matching up to the 2nd closing code bracket Is there a way to match everything inside the code brackets up until the first occurrence of ending code bracket?
All repetition quantifiers in regular expressions are greedy by default (matching as many characters as possible). Make the * ungreedy, like this:
/<code>([\s\S]*?)<\/code>/
But please consider using a DOM parser instead. Regex is just not the right tool to parse HTML.
And I just learned that for going through multiple parts, the
String.scan( /<code>(.*?)<\/code>/ ){
puts $1
}
is a very nice way of going through all occurences of code - but yes, getting a proper parser is better...
I need to find strings with * and / using reg-exes, I am writing in Ruby.The reason for this need to find lots of * and / is that I am building a tokenizer for an language and there are multi-line comments that use the C style of multi-line comments (/* */). I have the single line comments handled already.
Is there a way to use reg-ex without having to use the two foreword slashes to indicate some regular expression because I am finding it impossible to find my mistakes due to the insane amount of escaping. Or can someone give me advise on how to handle the escaping in a sane matter? I already tried writing the sequence first then escaping it.
Thank you for your time and advise.
One trick that might help is the %r literal:
%r{http://www\.google\.com}
I like to use pipes myself, when they're not in the regex.
%r|http://www\.google\.com|
You can also create new instances of Regexp via Regexp.new and pass a string.
Finally, you might also look at Regexp.quote:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.