Ruby regex syntax for "not matching one of the following" - ruby

Nice simple regex syntax question for you.
I have a block of text and i want to find instances of href=" or href=' which are NOT followed by either [ or http://
I can get "not followed by [" with
record.body =~ /href=['"](?!\[)/
and i can get "not followed by http://" with
record.body =~ /href=['"](?!http\:\/\/)/
But i can't quite work out how to combine the two.
Just to be clear: i want to find bad strings like this
`href="www.foo.com"`
but i'm ok with (ie don't want to find) strings like this
`href="http://www.foo.com"`
`href="[registration_url]"`

Combine the both by using the alternation operator.
href=['"](?!http\:\/\/|\[)
For more specific, it would be.
href=(['"])(?!http\:\/\/|\[)(?:(?!\1).)*\1
This would handle both single quoted or double quoted string in the href part. And this won't match the strings like href='foo.com" or href="foo.com' (unmatched quotes)
(['"]) would capture double quote or single quote. (?!http\:\/\/|\[) and the matched quote won't be followed by http:// or [, if yes, then it moves on to the next pattern. (?:(?!\1).)* matches any character but not of the captured character, zero or more times. \1 followed by the captured character.
DEMO

Use alternative list with pipe | symbol to combine the look-ahead conditions:
(?!http\:\/\/|\[)
So, to match the hrefs, you can use the following regex:
href=\"((?!http\:\/\/|\[)[^\"]+?)\"
See demo on Rubular.com.

Related

Regex match all strings between 2 characters

I tried a couple of other links like Regex Match all characters between two strings and Regex get all content between two characters
but they don't seem to fit this use case.
I want to get all the names, potato and tomato. Eg, from | to >.
text = "with <#U0D08NR3|potato> and <#U1698M96|tomato> please"
text.scan((?<=|).*?(?=>)) doesnt seem to work either..
Please guide me regex gods.
You forgot to escape the pipe (|), which is now interpreted as the indicator for an alternation
Your regex with the escaped pipe:
(?<=\|).*?(?=>)
Here you can see the result
Just escape the | : (?<=\|).*?(?=>). Without it, the positive lookbehind means match anything
Try this:
text.scan(/\|(\w*)\>/).flatten
# Returns => ["potato", "tomato"]
Not exactly sure why this works. Something to do with greedy and non-greedy matching. See this

Matching only a single standalone letter

I'm trying to write a regular expression that matches only a single standalone letter only, such as a,C,f,G, but, NOT abc or de for instance.
I tried [a-zA-z], but all of the above match.
What should I do in this case?
^[a-zA-Z]$
Add ^$ or anchors to limit match to just one character.
or
(?:^|(?<=[^a-zA-Z]))[a-zA-Z](?=[^a-zA-Z]|$)
There are several ways to do this, depending on your content. This could work:
[^a-zA-Z][a-zA-Z][^a-zA-Z]
Or there's a regex code for that, the \b:
\b[a-zA-Z]\b
which is more useful since it allows matches at the start and end of a line.
Your regex [a-zA-z] matches not only letters but also matches [, ], \, ^, _ and `. Moreover, it has no anchors and thus will match both a and t in at.
You can make use of the POSIX bracket expression alpha to match a single letter substring together with a word boundary \b:
puts 'a,C,f,G, but, NOT abc de'.scan(/\b[[:alpha:]]\b/)
See IDEONE demo
Output:
a
C
f
G

Lookahead containing the same token as left/right anchors

Got a variation of the classic "regex quoted strings" problem. I need to pick out strings that look like this:
"foo bar bar"
from a long string like this
token token "maybe quoted token that can also contain spaces"
Each of the tokens can be quoted or unquoted (this is easy to take care of using alternating groups) but sometimes I have quoted strings which have literal quotes inside them (not escaped in any way),
the only useable thing being that those quotes never have spaces on either side (since that would
create a delimiter). Those tokens look like this: "foo-bar"baz"
My initial thought was /"(?:[^"]|" )*"/ but that doesn't seem to work because a token like this: "here is some"quotes" gets split in two.
How should I do this? Platform is Ruby 2.1
Use this:
"(?:[^"]|"\w)+"
or
"(?:[^"]|"\S)+"
You can play with sample strings in the regex demo.
Explanation
" matches the opening quote
The non-capturing group(?:start [^"]|"\w) matches...
One [^"] non-quote character, OR |
One quote and a word character "\w
+ one or more times
" closing quote
Further Refinements
If you want to allow quotes in other contexts, for instance escaped quotes, just add them to the alternation:
"(?:\\"|[^"]|"\w)+"
To allow quotes to be followed not just by a word char but any non-space:
"(?:\\"|[^"]|"\S)+"
This one may also suit your needs:
".*?"(?!\S)
Debuggex Demo
To match also non-quoted tokens:
".*?"(?!\S)|\S+
Debuggex Demo

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

regular expression gsub only if it does not have anything before

Is there anyway to scan only if there is nothing before what I am scanning for.
For example I have a post and I am scanning for a forward slash and what follows it but I do not want to scan for a forward slash if it is not the beginning character.
I want to scan for /this but I do not want to scan for this/this or http://this.com.
The regular expression I am currently using is..
/\/(\w+)/
I am using this with gsub to link each /forwardslash.
I think what you are asking for is to only match words that begin with '/', not strings or lines beginning with '/'. If that is true, I believe the following regex will work: %r{(?:^|\s+)/(\w+)}:
For example:
"/foo /this this/that http://this".scan %r{(?:^|\s+)/(\w+)} # => [["foo"], ["this"]]
The caret (^) character means "beginning of string" -- a dollar sign ($) means "end of string."
So
/^\/(\w+)/
...will get you what you want -- only matching at the beginning of the string.
First thing, since you're using a regex with slashes change the delimiter to something else, then you won't have to escape the backslashes and it will be easier to read.
Secondly, if you want to replace the slash as well then include it in the capture.
On to the regex.
...if it is not the beginning
character...
...of a line:
!^(/\w+)!
if it is not the beginning
character...
...of a word:
!\s(/\w+)!
but that won't match if it's at the very beginning of a line. For that you'll need something a lot more complex, so I'd just run both the regexes here instead of creating that monster.

Resources