Ruby Regex: Match Until First Occurance of Character - ruby

I have a file with lines that vary in their format, but the basic idea is like this:
- A block of text #tag #due(2014-04-20) #done(2014-04-22)
For example:
- Email John Doe #email #due(2014-04-20) #done(2014-04-22)
The issue is the #tag and the #due date do not appear in every entry, so some are just like:
- Email John Doe #done(2014-04-22)
I'm trying to write a Ruby Regex that finds the item between the "- " and the first occurrence of EITHER a hashtag or a #done/#due tag.
I have been trying to use groups and look ahead, but I can't seem to get it right when there are multiple instances of what I am looking ahead for. Using my second example string, this Regex:
/-\s(.*)(?=[#|#])/
Yields this result for (.*):
Email John Doe #email #due(2014-04-22)
Is there any way I can get this right? Thanks!

You're missing the ? quantifier to make it a non greedy match. And I would remove | from inside of your character class because it's trying to match a single character in the list (#|#) literally.
/-\s(.*?)(?=[##])/
See Demo
You really don't need a Positive Lookahead here either, just match up until those characters and print the result from your capturing group.
/-\s(.*?)[##]/
You could also use negation in this case.
/-\s([^##]*)/

This should do it:
str = "- Email John Doe #email #due(2014-04-20) #done(2014-04-22)"
str[/-(.*?)#|#due|#done/,1]
#=> " Email John Doe "
(.*?) is a capture group, with ? making .* non-greedy. The result of the capture is retrieved by the ,1 at the end.
Credit to #hwnd for noticing the need to make .* non-greedy shortly before I posted, though I did not see the comment until later.

Related

select quotes but NOT apostrophes in REGEX

I would like to achieve this amazing result (I'm using Ruby):
input: "Joe can't tell between 'large' and large."
output: "Joe can't tell between large and large."
getting rid of the quotes but not of the apostrophe
how can I do it in a simple way?
my failed overcomplicated attempt:
entry = test[0].gsub(/[[']*1]/, "")
Simplest one for your situation could be something like this.
Regex: /\s'|'\s/ and replace with a space.
Regex101 Demo
You can also go with /(['"])([A-Za-z]+)\1/ and replace with \2 i.e second captured group.
Regex101 Demo
Here's a script to demo an answer:
x = "Joe can't tell between 'large' and large."
puts x.gsub(/'\s|\s'/, " ")
# Output: Joe can't tell between large and large.
To decode what this script does - the gsub / regex line is saying:
Find all (an apostrophe followed by a space '/s) or (a space
followed by an apostrophe \s') and replace it with space.
This leaves apostrophes that aren't adjacent to spaces intact, which seems to remove only the apostrophes the OP is trying to remove.
Maybe this one?
entry = test[0].gsub(/[^']/, "")
But it should remove all '.
This does exactly what you are looking for, including ignoring the posted comments Students' example.
entry = test[0].gsub(/'(­[^\s]+)'/,­ '\1')
I don't have ruby set up, but i confirmed this works here: http://tryruby.org/levels/1/challenges/0
Here is an example on regex101:
https://regex101.com/r/aY8aJ3/1

Drop everything from a string before a specific word?

How can I remove everything in a string before a specific word (or including the first space and back)?
I have a string like this:
12345 Delivered to: Joe Schmoe
I only want Delivered to: Joe Schmoe
So, basically anything from the first space and back I don't want.
I'm running Ruby 1.9.3.
Use a regex to select just the part of the string you want.
"12345 Delivered to: Joe Schmoe"[/Delive.*/]
# => "Delivered to: Joe Schmoe"
Quite a few different ways are possible. Here are a couple:
s = '12345 Delivered to: Joe Schmoe'
s.split(' ')[1..-1].join(' ') # split on spaces, take the last parts, join on space
# or
s[s.index(' ')+1..-1] # Find the index of the first space and just take the rest
# or
/.*?\s(.*)/.match(s)[1] # Use a reg ex to pick out the bits after the first space
If Delivered isn't always the 2nd word, you can use this way:
s_line = "12345 Delivered to: Joe Schmoe"
puts s_line[/\s.*/].strip #=> "Delivered to: Joe Schmoe"

How to extract first and last name from full name

I have a regex that, given the full name, is supposed to capture the first and last name. It should exclude the suffix, like "Jr.":
(.+)\s(.+(?!\sJr\.))
But this regex applied against the string Larry Farry Barry Jones Jr. gives the match:
1. Larry Farry Barry Jones
2. Jr.
Why is my negative lookahead failing to ignore the "Jr." when parsing the full name? I want match #2 to contain "Jones".
Rather than trying to do it with a single regex, I think the following would be a more maintainable code.
full_name = "Larry Farry Barry Jones Jr."
name_parts = full_name.split - ["Jr."]
first_name, last_name = name_parts[0], name_parts[-1]
As a comment mentions it is the first .* that matches most of the string. The use of look ahead seems in correct here, as you do not want to return that value and do not need it to be included in a further match.
The following will split all words up but not return the 'Jr.' So you could take the first and last result.
(\w+\s)+?(?!\sJr\.)
I recommend Rubular for practicing Ruby RegExp.
The reason is that your string is matched by your .+ till the end and then does the regex lookahead, there is no "Jr." following (because we are already at the end) ==> perfect, we match!!!
But that is because your pattern is wrong. Better would be this:
\S+(?:\s(?!Jr\.)\S+)*
See it here on Regexr
Means:
\S+ match a series of at least one non whitespace character.
(?:\s(?!Jr\.)\S+)* Non capturing group: Match a whitespace and then, if it is not "Jr.", match the next series of non whitespace characters. This complete group can be repeated 0 or more times.

Regular expression match that excludes characters inside parenthesis

I have the following types of strings.
BILL SMITH (USA)
WINTHROP (FR)
LORD AT WAR (GB)
KIM SMITH
With these strings, I have the following constraints:
1. all caps
2. can be 2 to 18 charters long
3. should not have any white spaces or carriage returns at the end
4. the country abbreviation inside the parens should be excluded
5. some of the names will not have the country in parens and they should be matched too
After applying my regular expression I'd like to get the following:
BILL SMITH (USA) => BILL SMITH
WINTHROP (FR) => WINTHROP
LORD AT WAR (GB) = LORD AT WAR
KIM SMITH => KIM SMITH
I came up with the following regular expression but I'm not getting any matches:
String.scan(\([A-Z \s*]{1,18})(^?!(\([A-Z]{1,3}\)))\)
I been banging my head on this for a while so if anyone can point error I'd appreciated it.
UPDATE:
I've gotten some great responses, however, so far none of the regular expression solutions have met all the constraints. The tricky part seems to be that some of the string has the country in parenthesis and some don't. In one case strings without the country was not being matched and in another it was returning the correct string along with the country abbreviation without the parenthesis. (See the comments on the second response.) One point of clarification: All of the strings that I will be matching will be the start point of the string. Not sure if that helps or not. Thanks again for all your help.
Here's one solution:
^((?:[A-Z]|\s){2,18}+?)(?:\s\([A-Z]+\))?$
See it on Rubular. Note that it counts 18 characters before the parenthesis - not sure how you want it to behave specificically. If you want to make sure the whole line isn't more than 18 characters, I suggest you just do unless line.length < 18 ... Similarly, if you want to make sure there is no whitespace at the end, I recommend you use line.strip. That'll greatly reduce the complexity of the Regexp you need and make your code more readable.
Edit: also works when no parentheses are used after the name.
The biggest error is that you wrote (^?!...) where you meant (?=...). The former means "an optional start-of-line anchor, followed by !, followed by ..., inside a capture group"; the latter means "a position in the string that is followed by ...". Fixing that, as well as makin a few other tweaks, and adding the requirement that the initial string end with a letter, we get:
[A-Z\s]{1,17}[A-Z])(?=\s*\([A-Z]{1,3}\)
Update based on OP comments: Since this will always match at the start of a string, you can use \A to "anchor" your pattern to the start of the string. You can then get rid of the lookahead assertion. This:
\A[A-Z][A-Z\s]{0,16}[A-Z]
matches start-of-string, followed by an uppercase letter, followed by up to 16 characters that are either uppercase letters or whitespace characters, followed by an uppercase letter.
You can also just use gsub to remove the part(s) you don't want. To remove everything in parenthesis you could do:
str.gsub(/\s*\([^)]*\)/, '')

Ruby regex: remove first name, leave last name

I am parsing a text and I want to ignore people's first names.
Examples (cases):
B.Obama => Obama
B. Obama => Obama
B . Obama => Obama
I manage to write this working Ruby regex:
"B.Obama".gsub(/\p{L}+\.(\p{L}+)/, '\\1')
However, it solves only one case. Also, it doesn't check, if the first letter is capital.
So, how should the regex, which combines all these cases, look like?
Details: Ruby 1.92 and UTF-8 strings.
I Gave a it a little bit more thought and I like this better:
/^(\w+)[ .,](.+$)/
This will capture both the first name and last name in different capturing groups
i.e.
"Mark del cato".scan /^(\w+)[ .,](.+$)/
see rubular for example: Rubular
Or Try
^[^ .]+
This will pick up the first word on a line. that is not delimited by a dot or a space.
Hope it helps, see example at Rubular
Try
(\w+)$
\w+ matches one or more 'word' characters.
The $ is a zero-length match matching the end of the string.
Do you want to be able to pull second names from a piece of text? That could get very difficult. Can you post an excerpt of your text?

Resources