regular expression gsub only if it does not have anything before - ruby

Is there anyway to scan only if there is nothing before what I am scanning for.
For example I have a post and I am scanning for a forward slash and what follows it but I do not want to scan for a forward slash if it is not the beginning character.
I want to scan for /this but I do not want to scan for this/this or http://this.com.
The regular expression I am currently using is..
/\/(\w+)/
I am using this with gsub to link each /forwardslash.

I think what you are asking for is to only match words that begin with '/', not strings or lines beginning with '/'. If that is true, I believe the following regex will work: %r{(?:^|\s+)/(\w+)}:
For example:
"/foo /this this/that http://this".scan %r{(?:^|\s+)/(\w+)} # => [["foo"], ["this"]]

The caret (^) character means "beginning of string" -- a dollar sign ($) means "end of string."
So
/^\/(\w+)/
...will get you what you want -- only matching at the beginning of the string.

First thing, since you're using a regex with slashes change the delimiter to something else, then you won't have to escape the backslashes and it will be easier to read.
Secondly, if you want to replace the slash as well then include it in the capture.
On to the regex.
...if it is not the beginning
character...
...of a line:
!^(/\w+)!
if it is not the beginning
character...
...of a word:
!\s(/\w+)!
but that won't match if it's at the very beginning of a line. For that you'll need something a lot more complex, so I'd just run both the regexes here instead of creating that monster.

Related

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

Matching only a single standalone letter

I'm trying to write a regular expression that matches only a single standalone letter only, such as a,C,f,G, but, NOT abc or de for instance.
I tried [a-zA-z], but all of the above match.
What should I do in this case?
^[a-zA-Z]$
Add ^$ or anchors to limit match to just one character.
or
(?:^|(?<=[^a-zA-Z]))[a-zA-Z](?=[^a-zA-Z]|$)
There are several ways to do this, depending on your content. This could work:
[^a-zA-Z][a-zA-Z][^a-zA-Z]
Or there's a regex code for that, the \b:
\b[a-zA-Z]\b
which is more useful since it allows matches at the start and end of a line.
Your regex [a-zA-z] matches not only letters but also matches [, ], \, ^, _ and `. Moreover, it has no anchors and thus will match both a and t in at.
You can make use of the POSIX bracket expression alpha to match a single letter substring together with a word boundary \b:
puts 'a,C,f,G, but, NOT abc de'.scan(/\b[[:alpha:]]\b/)
See IDEONE demo
Output:
a
C
f
G

CamelCase regexp not accounting for spaces

I created a regexp to match the following scenerios: SomethingCool, HelloWorld, MyNameIsDonato, etc. However, it does not account for spaces:
> 'Something Cooler' =~ /([A-Z][a-z0-9]+)+/
=> 0
That passes and it should not pass. A space is not an alphanumeric character. So why does this pass and how can I fix it?
You need to anchor the regex to the beginning and end of the string, or it will just match one of the words:
^([A-Z][a-z0-9]+)+$
^ and $ anchor the beginnings and ends of lines, respectively. To anchor to the beginning and end of the string, use \A and \Z.
It's worth noting that this is useless if you're trying to find camelcase names within a larger string. For that, use your original regex.

Regular expression help to skip first occurrence of a special character while allowing for later special chars but no whitespace

I'm looking for words starting with a hashtag: "#yolo"
My regex for this was very simple: /#\w+/
This worked fine until I hit words that ended with a question mark: "#yolo?".
I updated my regex to allow for words and any non whitespace character as well: /#[\w\S]*/.
The problem is I sometimes need to pull a match from a word starting with two '#' characters, up until whitespace, that may contain a special character in it or at the end of the word (which I need to capture).
Example:
"##yolo?"
And I would like to end up with:
"#yolo?"
Note: the regular expressions are for Ruby.
P.S. I'm testing these out here: http://rubular.com/
Maybe this would work
#(#?[\S]+)
What about
#[^#\s]+
\w is a subset of ^\s (i.e. \S) so you don't need both. Also, I assume you don't want any more #s in the match, so we use [^#\s] which negates both whitespace and # characters.

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

Resources