How can I remove everything in a string before a specific word (or including the first space and back)?
I have a string like this:
12345 Delivered to: Joe Schmoe
I only want Delivered to: Joe Schmoe
So, basically anything from the first space and back I don't want.
I'm running Ruby 1.9.3.
Use a regex to select just the part of the string you want.
"12345 Delivered to: Joe Schmoe"[/Delive.*/]
# => "Delivered to: Joe Schmoe"
Quite a few different ways are possible. Here are a couple:
s = '12345 Delivered to: Joe Schmoe'
s.split(' ')[1..-1].join(' ') # split on spaces, take the last parts, join on space
# or
s[s.index(' ')+1..-1] # Find the index of the first space and just take the rest
# or
/.*?\s(.*)/.match(s)[1] # Use a reg ex to pick out the bits after the first space
If Delivered isn't always the 2nd word, you can use this way:
s_line = "12345 Delivered to: Joe Schmoe"
puts s_line[/\s.*/].strip #=> "Delivered to: Joe Schmoe"
Related
I have the following Ruby Regex that selects punctuation and excludes periods that are part of numbers:
/\p{L}+|(?!\.\d)[[:punct:]]/
The profit was 5.2 thousand dollars.
=> The profit was thousand dollars.
I have a regex that can select abbreviations (U.S.A) for example:
(?:[a-zA-Z]\.){2,}
The U.S.A. is located in North America.
=> U.S.A.
I would like to use the ideas behind these regexes so that I can select all of the words and punctuation in a sentence except for any periods in any abbreviation as:
The U.S.A. is located in North America!
=> The USA is located in North America!
Any ideas on how to accomplish this?
I think it should be done in 2 steps because you cannot match discontinuous parts of text with one matching iteration.
Use
s = 'The U.S.A. is located in North America!'
s = s.gsub(/\b(?:\p{L}\.){2,}/) { $~[0].gsub(".", "") }
puts s.scan(/\p{L}+|(?!\.\d)[[:punct:]]/)
See the Ruby demo
The first step is to run a gsub with the \b(?:\p{L}\.){2,} pattern (I added a word boundary to make sure the pattern only matches 1 letter chunks). Within the block, the match value is stripped from dots using a literal text replacement.
The second step is running your first regex within a scan to collect the chunks you need.
str = "The U.S.A. have 50.1415 states approx and are located in North America!"
str.gsub(/(?<!\p{L}\p{L})\P{L}*\.[^\p{L}\s]*/, '').squeeze
#⇒ "The USA have states aprox and are located in North America!"
I think using regex alone will be difficult, I'll be glad to be corrected with a working solution.
My solution:
Parse the code that you don't want (the abbreviations) using your second regex first, and then use the first regex (selects words and punctuations). This will effectively hide the abbreviations for processing when you run your first regex.
I have a similar requirement for a project. The key thing is to use the partition method, loop through the regex's (2 in your case) and make sure you don't use the same regex to the string that was "captured" by the previous regex in the loop.
You may use this class from github: SourceParser and use it like this:
parser = SourceParser.new
parser.regexter('abbrs', /(?:[a-zA-Z]\.){2,}/) # return matched as is
parser.regexter(
'first regex',
/\p{L}+|(?!\.\d)[[:punct:]]/,
lambda do |token, regexp|
"(#{token})"
end
)
parser.parse("The U.S.A. is located in North America")
# => (The) U.S.A. (is) (located) (in) (North) (America)
I have a file with lines that vary in their format, but the basic idea is like this:
- A block of text #tag #due(2014-04-20) #done(2014-04-22)
For example:
- Email John Doe #email #due(2014-04-20) #done(2014-04-22)
The issue is the #tag and the #due date do not appear in every entry, so some are just like:
- Email John Doe #done(2014-04-22)
I'm trying to write a Ruby Regex that finds the item between the "- " and the first occurrence of EITHER a hashtag or a #done/#due tag.
I have been trying to use groups and look ahead, but I can't seem to get it right when there are multiple instances of what I am looking ahead for. Using my second example string, this Regex:
/-\s(.*)(?=[#|#])/
Yields this result for (.*):
Email John Doe #email #due(2014-04-22)
Is there any way I can get this right? Thanks!
You're missing the ? quantifier to make it a non greedy match. And I would remove | from inside of your character class because it's trying to match a single character in the list (#|#) literally.
/-\s(.*?)(?=[##])/
See Demo
You really don't need a Positive Lookahead here either, just match up until those characters and print the result from your capturing group.
/-\s(.*?)[##]/
You could also use negation in this case.
/-\s([^##]*)/
This should do it:
str = "- Email John Doe #email #due(2014-04-20) #done(2014-04-22)"
str[/-(.*?)#|#due|#done/,1]
#=> " Email John Doe "
(.*?) is a capture group, with ? making .* non-greedy. The result of the capture is retrieved by the ,1 at the end.
Credit to #hwnd for noticing the need to make .* non-greedy shortly before I posted, though I did not see the comment until later.
I am new to ruby and regular expressions and trying to figure out how to attack seperating the attached string of baseball players into first/last name combinations.
This is a sample string:
"JohnnyCuetoJ.J.PutzBrianMcCann"
This is the desired output:
Johnny Cueto
J.J. Putz
Brian McCann
I have figured out how to separate by capital letters which gets me close, but the outlier names like J.J. and McCann mess that pattern up. Anyone have ideas on the best way to approach this?
If you don't have to do it in one single gsub than it gets a bit easier.
string = "JohnnyCuetoJ.J.PutzBrianMcCann"
string.gsub!(/([A-Z][^A-Z]+)/, '\1 ') # separate by capital letters
string.gsub!(/(\.) ([A-Z]\.)/, '\1\2') # paste together "J. J." -> "J.J."
string.gsub!(/Mc /, 'Mc') # Remove the space in "Mc "
string.strip # Remove the extra space after "Cann "
...and of course you can put this on a single line by chaining the gsub calls, but that will basically kill the readability of the code (but on the other hand, how readable is a block of regexen anyway?)
I have a string pattern that, as an example, looks like this:
WBA - Skinny Joe vs. Hefty Hal
I want to truncate the pattern "WBA - " from the string and return just "Skinny Joe vs. Hefty Hal".
Assuming that the "WBA" spot will be a sequence of any letter or number, followed by a space, dash, and space:
str = "WBA - Skinny Joe vs. Hefty Hal"
str.sub /^\w+\s-\s/, ''
By the way — RegexPal is a great tool for testing regular expressions like these.
If you need a more complex string replacement, you can look into writing a more sophisticated regular expression. Otherwise:
Keep it simple! If you only need to remove "WBA - " from the beginning of the string, use String#sub.
s = "WBA - Skinny Joe vs. Hefty Hal"
puts s.sub(/^WBA - /, '')
# => Skinny Joe vs. Hefty Hal
You can also remove the first occurrence of a pattern with the following snippet:
s[/^WBA - /] = ''
I am parsing a text and I want to ignore people's first names.
Examples (cases):
B.Obama => Obama
B. Obama => Obama
B . Obama => Obama
I manage to write this working Ruby regex:
"B.Obama".gsub(/\p{L}+\.(\p{L}+)/, '\\1')
However, it solves only one case. Also, it doesn't check, if the first letter is capital.
So, how should the regex, which combines all these cases, look like?
Details: Ruby 1.92 and UTF-8 strings.
I Gave a it a little bit more thought and I like this better:
/^(\w+)[ .,](.+$)/
This will capture both the first name and last name in different capturing groups
i.e.
"Mark del cato".scan /^(\w+)[ .,](.+$)/
see rubular for example: Rubular
Or Try
^[^ .]+
This will pick up the first word on a line. that is not delimited by a dot or a space.
Hope it helps, see example at Rubular
Try
(\w+)$
\w+ matches one or more 'word' characters.
The $ is a zero-length match matching the end of the string.
Do you want to be able to pull second names from a piece of text? That could get very difficult. Can you post an excerpt of your text?