Extract a word from a sentence in Ruby - ruby

I have the following string:
str = "XXX host:1233455 YYY ZZZ!"
I want to extract the value after host: from this string.
Is there any optimal way in Ruby to do this using RegExp, avoiding multiple loops?
Any solution is welcome.

If you have numbers, use the following regex:
(?<=host:)\d+
The lookbehind will find the numbers right after host:.
See IDEONE demo:
str = "XXX host:1233455 YYY ZZZ!"
puts str.match(/(?<=host:)\d+/)
Note that if you want to match alphanumerics and not any punctuation, you can replace \d+ with \w+.
Also, if you also have dots, or commas inside, you can use
/(?<=host:)\d+(?:[.,]\d+)*/
It will extract values like 4,445 or 44.45.455.
UPDATE:
In case you need a more universal solution (especially if you need to use the regex on another platform where look-behind is not supported (as in JavaScript), use capture group approach:
str.match(/\bhost:(\d+)/).captures.first
Note that \b makes sure we find host: as a whole word, not localhost:. (\d+) is the capture group whose value we can refer to with the backreferences, or via .captures.first in Ruby.

str[/host:(\d+)/, 1]
# => "1233455"

What about the regex:
host:(\S+)
Here you can find a demo.

You can capture the value for example.
str.match(/host:(\d+)/).captures.first

Related

Ruby regex - gsub only captured group

I'm not quite sure I understand how non-capturing groups work. I am looking for a regex to produce this result: 5.214. I thought the regex below would work, but it is replacing everything including the non-capture groups. How can I write a regex to only replace the capture groups?
"5,214".gsub(/(?:\d)(,)(?:\d)/, '.')
# => ".14"
My desired result:
"5,214".gsub(some_regex)
#=> "5.214
non capturing groups still consumes the match
use
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
or
"5,214".gsub(/(?<=\d+)(,)(?=\d+)/, '.')
You can't. gsub replaces the entire match; it does not do anything with the captured groups. It will not make any difference whether the groups are captured or not.
In order to achieve the result, you need to use lookbehind and lookahead.
"5,214".gsub(/(?<=\d),(?=\d)/, '.')
It is also possible to use Regexp.last_match (also available via $~) in the block version to get access to the full MatchData:
"5,214".gsub(/(\d),(\d)/) { |_|
match = Regexp.last_match
"#{match[1]}.#{match[2]}"
}
This scales better to more involved use-cases.
Nota bene, from the Ruby docs:
the ::last_match is local to the thread and method scope of the method that did the pattern match.
gsub replaces the entire match the regular expression engine produces. Both capturing/non-capturing group constructs are not retained. However, you could use lookaround assertions which do not "consume" any characters on the string.
"5,214".gsub(/\d\K,(?=\d)/, '.')
Explanation: The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. That being said, we then look for and match the comma, and the Positive Lookahead asserts that a digit follows.
I know nothing about ruby.
But from what i see in the tutorial
gsub mean replace,
the pattern should be /(?<=\d+),(?=\d+)/ just replace the comma with dot
or, use capture /(\d+),(\d+)/ replace the string with "\1.\2"?
You can easily reference capture groups in the replacement string (second argument) like so:
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
#=> "5.214"
\0 will return the whole matched string.
\1 will be replaced by the first capturing group.
\2 will be replaced by the second capturing group etc.
You could rewrite the example above using a non-capturing group for the , char.
"5,214".gsub(/(\d+)(?:,)(\d+)/, '\1.\2')
#=> "5.214"
As you can see, the part after the comma is now the second capturing group, since we defined the middle group as non-capturing.
Although it's kind of pointless in this case. You can just omit the capturing group for , altogether
"5,214".gsub(/(\d+),(\d+)/, '\1.\2')
#=> "5.214"
You don't need regexp to achieve what you need:
'1,200.00'.tr('.','!').tr(',','.').tr('!', ',')
Periods become bangs (1,200!00)
Commas become periods (1.200!00)
Bangs become commas (1.200,00)

ruby remove variable length string from regular expression leaving hyphen

I have a string such as this: "im# -33.870816,151.203654"
I want to extract the two numbers including the hyphen.
I tried this:
mystring = "im# -33.870816,151.203654"
/\D*(\-*\d+\.\d+),(\-*\d+\.\d+)/.match(mystring)
This gives me:
33.870816,151.203654
How do I get the hyphen?
I need to do this in ruby
Edit: I should clarify, the "im# " was just an example, there can be any set of characters before the numbers. the numbers are mostly well formed with the comma. I was having trouble with the hyphen (-)
Edit2: Note that the two nos are lattidue, longitude. That pattern is mostly fixed. However, in theory, the preceding string can be arbitrary. I don't expect it to have nos. or hyphen, but you never know.
How about this?
arr = "im# -33.2222,151.200".split(/[, ]/)[1..-1]
and arr is ["-33.2222", "151.200"], (using the split method).
now
arr[0].to_f is -33.2222 and arr[1].to_f is 151.2
EDIT: stripped "im#" part with [1..-1] as suggested in comments.
EDIT2: also, this work regardless of what the first characters are.
If you want to capture the two numbers with the hyphen you can use this regex:
> str = "im# -33.870816,151.203654"
> str.match(/([\d.,-]+)/).captures
=> ["33.870816,151.203654"]
Edit: now it captures hyphen.
This one captures each number separetely: http://rubular.com/r/NNP2OTEdiL
Note: Using String#scan will match all ocurrences of given pattern, in this case
> str.scan /\b\s?([-\d.]+)/
=> [["-33.870816"], ["151.203654"]] # Good, but flattened version is better
> str.scan(/\b\s?([-\d.]+)/).flatten
=> ["-33.870816", "151.203654"]
I recommend you playing around a little with Rubular. There's also some docs about regegular expressions with Ruby:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ
http://www.regular-expressions.info/ruby.html
http://www.ruby-doc.org/core-1.9.3/Regexp.html
Your regex doesn't work because the hyphen is caught by \D, so you have to modify it to catch only the right set of characters.
[^0-9-]* would be a good option.

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

Regex to match all alphanumeric hashtags, no symbols

I am writing a hashtag scraper for facebook, and every regex I come across to get hashtags seems to include punctuation as well as alphanumeric characters. Here's an example of what I would like:
Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression.
I would like it to match world, m4king, fac and expression (note that I would like it to cut off if it reaches punctuation, including spaces). It would be nice if it didn't include the hash symbol, but it's not super important.
Just incase it's important, I will be using ruby's string scan method to grab possibly more than one tag.
Thanks heaps in advance!
A regex such as this: #([A-Za-z0-9]+) should match what you need and place it in a capture group. You can then access this group later. Maybe this will help shed some light on regular expressions (from a Ruby context).
The regex above will start matching when it finds a # tag and will throw any following letters or numbers into a capture group. Once it finds anything which is not a letter or a digit, it will stop the matching. In the end you will end up with a group containing what you are after.
str = 'Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression'
str.scan(/#([A-Za-z0-9]+)/).flatten #=> ["world", "m4king", "fac", "expression"]
The call to #flatten is needed because each capture group will be inside its own array.
Alternatively, you can use look-behind matching which will match alphanumeric characters only after a '#':
str.scan /(?<=#)[[:alnum:]]+/ #=> ["world", "m4king", "fac", "expression"]
Here's a simpler regex #[[:alnum:]_]/. Note it includes underscores because Facebook currently includes underscores as part of hashtags (as does twitter).
str = 'Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression'
str.scan(/#[[:alnum:]_]+/)
Here's a view on Rubular:
http://rubular.com/r/XPPqwtVGN9

Ruby regular expression for this string?

I'm trying to get the first word in this string: Basic (11/17/2011 - 12/17/2011)
So ultimately wanting to get Basic out of that.
Other example string: Premium (11/22/2011 - 12/22/2011)
The format is always "Single-word followed by parenthesized date range" and I just want the single word.
Use this:
str = "Premium (11/22/2011 - 12/22/2011)"
str.split.first # => "Premium"
The split uses ' ' as default parameter if you don't specify any.
After that, get the first element with first
You don't need regexp for that, you can just use
str.split(' ')[0]
I know you found the answer you are needing but in case anyone stumbles on this in the future, in order to pull the needed value out of a large String of unknown length:
word_you_need = s.slice(/(\b[a-zA-Z]*\b \(\d+\/\d+\/\d+ - \d+\/\d+\/\d+\))/).split[0]
This regular expression will match the first word with out the trailing space
"^\w+ ??"
If you really want a regex you can get the first group after using this regex:
(\w*) .*
"Single-word followed by parenthesized date range"
'word' and 'parenthesized date range' should be better defined
as, by your requirement statement, they should be anchors and/or delimeters.
These raw regex's are just a general guess.
\w+(?=\s*\([^)]*\))
or
\w+(?=\s*\(\s*\d+(?:/\d+)*\s*-\s*\d+(?:/\d+)*\s*\))
Actually, all you need is:
s.split[0]
...or...
s.split.first

Resources