How to extract portion of a line in ruby? - ruby

I have a line say
line = "start running at Sat April 1 07:30:37 2017"
and I want to extract
"Sat April 1 07:30:37 2017"
I tried this...
line = "start running at Sat April 1 07:30:37 2017"
if (line =~ /start running at/)
line.split("start running at ").last
end
... but is there any other way of doing this?

This is a way to extract, from an arbitrary string, a substring that represents a time in the given format. I've assumed there is at most one such substring in the string.
require 'time'
R = /
(?:#{Date::ABBR_DAYNAMES.join('|')})\s
# match day name abbreviation in non-capture group. space
(?:#{Date::MONTHNAMES[1,12].join('|')})\s
# match month name in non-capture group, space
\d{1,2}\s # match one or two digits, space
\d{2}: # match two digits, colon
\d{2}: # match two digits, colon
\d{2}\s # match two digits, space
\d{4} # match 4 digits
(?!\d) # do not match digit (negative lookahead)
/x # free-spacing regex def mode
# /
# (?:Sun|Mon|Tue|Wed|Thu|Fri|Sat)\s
# (?:January|February|March|...|November|December)\s
# \d{1,2}\s
# \d{2}:
# \d{2}:
# \d{2}\s
# \d{4}
# (?!\d)
# /x
def extract_time(str)
s = str[R]
return nil if s.nil?
(DateTime.strptime(s, "%a %B %e %H:%M:%S %Y") rescue nil) ? s : nil
end
str = "start eating breakfast at Sat April 1 07:30:37 2017"
extract_time(str)
#=> "Sat April 1 07:30:37 2017"
str = "go back to sleep at Cat April 1 07:30:37 2017"
extract_time(str)
#=> nil
Alternatively, if there is a match against R, but Time#strptime raises an exception (meaning s is not a valid time for the given time format) one could raise an exception to advise the user.

try
line.sub(/start running at (.*)/, '\1')

The standard way to do this with regular expressions would be:
if md = line.match(/start running at (.*)/)
md[1]
end
But you don't need regular expressions, you can do regular string operations:
prefix = 'start running at '
if line.start_with?(prefix)
line[prefix.size..-1]
end

Here's another (as it turns out, slightly faster) option using #partition:
# will return empty string if there is no match, instead of raising an exception like split.last will
line.partition('start running at ').last
I was interested how this performs against regexp match, so here's a quick benchmark with 1 million executions each:
line.sub(/start running at (.*)/, '\1')
# => #real=1.7465
line.partition('start running at ').last
# => #real=0.712406
# => this is faster, but you'd need to be calling this quite a bit for it to make a significant difference
Bonus: it also makes it really easy to cater for a more general case e.g. if you have lines that start with "start running at" and others that start with "stop running at". Then something like line.partition(' at ').last will cater for both (and actually run slightly faster).

And yet another alternative:
puts $1 if line =~ /start running at (.*)/

The shortest would be line["Sat April 1 07:30:37 2017"] which would return your "Sat April 1 07:30:37 2017" string if present and nil if not.
The [] notation on a String is a shorthand for getting a substring out of the string and can be used with another string or a Regular Expression. See https://ruby-doc.org/core-2.2.0/String.html#method-i-5B-5D
In case the string is unknown you can use this shorthand also like Cary suggested
line[/start running at (.*)/, 1]
In case you want to be sure the date extracted is valid you would need the regular expression from his answer but you still could use this method.

Related

Remove all special char except apostrophe

Given a sentence, I want to count all the duplicated words:
It is an exercice from Exercism.io Word count
For example for the input "olly olly in come free"
plain
olly: 2
in: 1
come: 1
free: 1
I have this test for exemple:
def test_with_quotations
phrase = Phrase.new("Joe can't tell between 'large' and large.")
counts = {"joe"=>1, "can't"=>1, "tell"=>1, "between"=>1, "large"=>2, "and"=>1}
assert_equal counts, phrase.word_count
end
this is my method
def word_count
phrase = #phrase.downcase.split(/\W+/)
counts = phrase.group_by{|word| word}.map {|k,v| [k, v.count]}
Hash[*counts.flatten]
end
For the test above I have this failure when I run it in the terminal:
2) Failure:
PhraseTest#test_with_apostrophes [word_count_test.rb:69]:
--- expected
+++ actual
## -1 +1 ##
-{"first"=>1, "don't"=>2, "laugh"=>1, "then"=>1, "cry"=>1}
+{"first"=>1, "don"=>2, "t"=>2, "laugh"=>1, "then"=>1, "cry"=>1}
My problem is to remove all chars except 'apostrophe...
the regex in the method almost works...
phrase = #phrase.downcase.split(/\W+/)
but it remove the apostrophes...
I don't want to keep the single quote around a word, 'Hello' => Hello
but Don't be cruel => Don't be cruel
Maybe something like:
string.scan(/\b[\w']+\b/i).each_with_object(Hash.new(0)){|a,(k,v)| k[a]+=1}
The regex employs word boundaries (\b).
The scan outputs an array of the found words and for each word in the array they are added to the hash, which has a default value of zero for each item which is then incremented.
Turns out my solution whilst finding all items and ignoring case will still leave the items in the case they were found in originally.
This would now be a decision for Nelly to either accept as is or to perform a downcase on the original string or the array item as it is added to the hash.
I'll leave that decision up to you :)
Given:
irb(main):015:0> phrase
=> "First: don't laugh. Then: don't cry."
Try:
irb(main):011:0> Hash[phrase.downcase.scan(/[a-z']+/)
.group_by{|word| word.downcase}
.map{|word, words|[word, words.size]}
]
=> {"first"=>1, "don't"=>2, "laugh"=>1, "then"=>1, "cry"=>1}
With your update, if you want to remove single quotes, do that first:
irb(main):038:0> p2
=> "Joe can't tell between 'large' and large."
irb(main):039:0> p2.gsub(/(?<!\w)'|'(?!\w)/,'')
=> "Joe can't tell between large and large."
Then use the same method.
But you say -- gsub(/(?<!\w)'|'(?!\w)/,'') will remove the apostrophe in 'Twas the night before. Which I reply you will eventually need to build a parser that can determine the distinction between an apostrophe and a single quote if /(?<!\w)'|'(?!\w)/ is not sufficient.
You can also use word boundaries:
irb(main):041:0> Hash[p2.downcase.scan(/\b[a-z']+\b/)
.group_by{|word| word.downcase}
.map{|word, words|[word, words.size]}
]
=> {"joe"=>1, "can't"=>1, "tell"=>1, "between"=>1, "large"=>2, "and"=>1}
But that does not solve 'Tis the night either.
Another way:
str = "First: don't 'laugh'. Then: 'don't cry'."
reg = /
[a-z] #single letter
[a-z']+ #one or more letters or apostrophe
[a-z] #single letter
'? #optional single apostrophe
/ix #case-insensitive and free-spacing regex
str.scan(reg).group_by(&:itself).transfor‌​m_values(&:count)
#=> {"First"=>1, "don't"=>2, "laugh"=>1, "Then"=>1, "cry'"=>1}

Stuck in Abbreviation implementation to ruby string

I want to convert all the words(alphabetic) in the string to their abbreviations like i18n does. In other words I want to change "extraordinary" into "e11y" because there are 11 characters between the first and the last letter in "extraordinary". It works with a single word in the string. But how can I do the same for a multi-word string? And of course if a word is <= 4 there is no point to make an abbreviation from it.
class Abbreviator
def self.abbreviate(x)
x.gsub(/\w+/, "#{x[0]}#{(x.length-2)}#{x[-1]}")
end
end
Test.assert_equals( Abbreviator.abbreviate("banana"), "b4a", Abbreviator.abbreviate("banana") )
Test.assert_equals( Abbreviator.abbreviate("double-barrel"), "d4e-b4l", Abbreviator.abbreviate("double-barrel") )
Test.assert_equals( Abbreviator.abbreviate("You, and I, should speak."), "You, and I, s4d s3k.", Abbreviator.abbreviate("You, and I, should speak.") )
Your mistake is that your second parameter is a substitution string operating on x (the original entire string) as a whole.
Instead of using the form of gsub where the second parameter is a substitution string, use the form of gsub where the second parameter is a block (listed, for example, third on this page). Now you are receiving each substring into your block and can operate on that substring individually.
def short_form(str)
str.gsub(/[[:alpha:]]{4,}/) { |s| "%s%d%s" % [s[0], s.size-2, s[-1]] }
end
The regex reads, "match four or more alphabetic characters".
short_form "abc" # => "abc"
short_form "a-b-c" #=> "a-b-c"
short_form "cats" #=> "c2s"
short_form "two-ponies-c" #=> "two-p4s-c"
short_form "Humpty-Dumpty, who sat on a wall, fell over"
#=> "H4y-D4y, who sat on a w2l, f2l o2r"
I would recommend something along the lines of this:
class Abbreviator
def self.abbreviate(x)
x.gsub(/\w+/) do |word|
# Skip the word unless it's long enough
next word unless word.length > 4
# Do the same I18n conversion you do before
"#{word[0]}#{(word.length-2)}#{word[-1]}"
end
end
end
The accepted answer isn't bad, but it can be made a lot simpler by not matching words that are too short in the first place:
def abbreviate(str)
str.gsub(/([[:alpha:]])([[:alpha:]]{3,})([[:alpha:]])/i) { "#{$1}#{$2.size}#{$3}" }
end
abbreviate("You, and I, should speak.")
# => "You, and I, s4d s3k."
Alternatively, we can use lookbehind and lookahead, which makes the Regexp more complex but the substitution simpler:
def abbreviate(str)
str.gsub(/(?<=[[:alpha:]])[[:alpha:]]{3,}(?=[[:alpha:]])/i, &:size)
end

Regex to grab full firstname and first letter of last name

I have a list of users grabbed by the Etc Ruby library:
Thomas_J_Perkins
Jennifer_Scanner
Amanda_K_Loso
Aaron_Cole
Mark_L_Lamb
What I need to do is grab the full first name, skip the middle name (if given), and grab the first character of the last name. The output should look like this:
Thomas P
Jennifer S
Amanda L
Aaron C
Mark L
I'm not sure how to do this, I've tried grabbing all of the characters: /\w+/ but that will grab everything.
You don't always need regular expressions.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. Jamie Zawinski
You can do it with some simple Ruby code
string = "Mark_L_Lamb"
string.split('_').first + ' ' + string.split('_').last[0]
=> "Mark L"
I think its simpler without regex:
array = "Thomas_J_Perkins".split("_") # split at _
array.first + " " + array.last[0] # .first prints first name .last[0] prints first char of last name
#=> "Thomas P"
You can use
^([^\W_]+)(?:_[^\W_]+)*_([^\W_])[^\W_]*$
And replace with \1_\2. See the regex demo
The [^\W_] matches a letter or a digit. If you want to only match letters, replace [^\W_] with \p{L}.
^(\p{L}+)(?:_\p{L}+)*_(\p{L})\p{L}*$
See updated demo
The point is to match and capture the first chunk of letters up to the first _ (with (\p{L}+)), then match 0+ sequences of _ + letters inside (with (?:_\p{L}+)*_) and then match and capture the last word first letter (with (\p{L})) and then match the rest of the string (with \p{L}*).
NOTE: replace ^ with \A and $ with \z if you have independent strings (as in Ruby ^ matches the start of a line and $ matches the end of the line).
Ruby code:
s.sub(/^(\p{L}+)(?:_\p{L}+)*_(\p{L})\p{L}*$/, "\\1_\\2")
I'm in the don't-use-a-regex-for-this camp.
str1 = "Alexander_Graham_Bell"
str2 = "Sylvester_Grisby"
"#{str1[0...str1.index('_')]} #{str1[str1.rindex('_')+1]}"
#=> "Alexander B"
"#{str2[0...str2.index('_')]} #{str2[str2.rindex('_')+1]}"
#=> "Sylvester G"
or
first, last = str1.split(/_.+_|_/)
#=> ["Alexander", "Bell"]
first+' '+last[0]
#=> "Alexander B"
first, last = str2.split(/_.+_|_/)
#=> ["Sylvester", "Grisby"]
first+' '+last[0]
#=> "Sylvester G"
but if you insist...
r = /
(.+?) # match any characters non-greedily in capture group 1
(?=_) # match an underscore in a positive lookahead
(?:.*) # match any characters greedily in a non-capture group
(?:_) # match an underscore in a non-capture group
(.) # match any character in capture group 2
/x # free-spacing regex definition mode
str1 =~ r
$1+' '+$2
#=> "Alexander B"
str2 =~ r
$1+' '+$2
#=> "Sylvester G"
You can of course write
r = /(.+?)(?=_)(?:.*)(?:_)(.)/
This is my attempt:
/([a-zA-Z]+)_([a-zA-Z]+_)?([a-zA-Z])/
See demo
Let's see if this works:
/^([^_]+)(?:_\w)?_(\w)/
And then you'll have to combine the first and second matches into the format you want. I don't know Ruby, so I can't help you there.
And another attempt using a replacement method:
result = subject.gsub(/^([^_]+)(?:_[^_])?_([^_])[^_]+$/, '\1 \2')
We capture the entire string, with the relevant parts in capturing groups. Then just return the two captured groups
using the split method is much better
full_names.map do |full_name|
parts = full_name.split('_').values_at(0,-1)
parts.last.slice!(1..-1)
parts.join(' ')
end
/^[A-Za-z]{5,15}\s[A-Za-z]{1}]$/i
This will have the following criteria:
5-15 characters for first name then a whitespace and finally a single character for last name.

How the Anchor \z and \G works in Ruby?

I am using Ruby1.9.3. I am newbie to this platform.
From the doc I just got familiared with two anchor which are \z and \G. Now I little bit played with \z to see how it works, as the definition(End or End of String) made me confused, I can't understand what it meant say - by End. So I tried the below small snippets. But still unable to catch.
CODE
irb(main):011:0> str = "Hit him on the head me 2\n" + "Hit him on the head wit>
=> "Hit him on the head me 2\nHit him on the head with a 24\n"
irb(main):012:0> str =~ /\d\z/
=> nil
irb(main):013:0> str = "Hit him on the head me 24 2\n" + "Hit him on the head >
=> "Hit him on the head me 24 2\nHit him on the head with a 24\n"
irb(main):014:0> str =~ /\d\z/
=> nil
irb(main):018:0> str = "Hit1 him on the head me 24 2\n" + "Hit him on the head>
=> "Hit1 him on the head me 24 2\nHit him on the head with a11 11 24\n"
irb(main):019:0> str =~ /\d\z/
=> nil
irb(main):020:0>
Every time I got nil as the output. So how the calculation is going on for \z ? what does End mean? - I think my concept took anything wrong with the End word in the doc. So anyone could help me out to understand the reason what is happening with the out why so happening?
And also i didn't find any example for the anchor \G . Any example please from you people to make visualize how \G used in real time programming?
EDIT
irb(main):029:0>
irb(main):030:0* ("{123}{45}{6789}").scan(/\G(?!^)\{\d+\}/)
=> []
irb(main):031:0> ('{123}{45}{6789}').scan(/\G(?!^)\{\d+\}/)
=> []
irb(main):032:0>
Thanks
\z matches the end of the input. You are trying to find a match where 4 occurs at the end of the input. Problem is, there is a newline at the end of the input, so you don't find a match. \Z matches either the end of the input or a newline at the end of the input.
So:
/\d\z/
matches the "4" in:
"24"
and:
/\d\Z/
matches the "4" in the above example and the "4" in:
"24\n"
Check out this question for example of using \G:
Examples of regex matcher \G (The end of the previous match) in Java would be nice
UPDATE: Real-World uses for \G
I came up with a more real world example. Say you have a list of words that are separated by arbitrary characters that cannot be well predicted (or there's too many possibilities to list). You'd like to match these words where each word is its own match up until a particular word, after which you don't want to match any more words. For example:
foo,bar.baz:buz'fuzz*hoo-har/haz|fil^bil!bak
You want to match each word until 'har'. You don't want to match 'har' or any of the words that follow. You can do this relatively easily using the following pattern:
/(?<=^|\G\W)\w+\b(?<!har)/
rubular
The first attempt will match the beginning of the input followed by zero non-word character followed by 3 word characters ('foo') followed by a word boundary. Finally, a negative lookbehind assures that the word which has just been matched is not 'har'.
On the second attempt, matching picks back up at the end of the last match. 1 non-word character is matched (',' - though it is not captured due to the lookbehind, which is a zero-width assertion), followed by 3 characters ('bar').
This continues until 'har' is matched, at which point the negative lookbehind is triggered and the match fails. Because all matches are supposed to be "attached" to the last successful match, no additional words will be matched.
The result is:
foo
bar
baz
buz
fuzz
hoo
If you want to reverse it and have all words after 'har' (but, again, not including 'har'), you can use an expression like this:
/(?!^)(?<=har\W|\G\W)\w+\b/
rubular
This will match either a word which is immediately preceeded by 'har' or the end of the last match (except we have to make sure not to match the beginning of the input). The list of matches is:
haz
fil
bil
bak
If you do want to match 'har' and all following words, you could use this:
/\bhar\b|(?!^)(?<=\G\W)\w+\b/
rubular
This produces the following matches:
har
haz
fil
bil
bak
Sounds like you want to know how Regex works? Or do you want to know how Regex works with ruby?
Check these out.
Regexp Class description
The Regex Coach - Great for testing regex matching
Regex cheat sheet
I understand \G to be a boundary match character. So it would tell the next match to start at the end of the last match. Perhaps since you haven't made a match yet you cant have a second.
Here is the best example I can find. Its not in ruby but the concept should be the same.
I take it back this might be more useful

regular expression in ruby for strings with multiple patterns

i have a string with optional substrings and i was looking/working for/on regular expression with names captures, a single regular expression for all if possible.
in RUBY
Please help,
sample strings:
string1 = bike wash #a simple task
string2 = bike wash # bike point # a simple task with location
string3 = bike wash # bike point on 13 may 11 # task with location and date
string4 = bike wash # bike point on 13 may 11 # 10 AM # task with location, date and time
string5 = bike wash on 13 may 11 # 10 AM # task with date and time without location
string6 = bike wash on 13 may 11 # task and date
i have spent almost a day in google and stackoverflow to get a single regular expression for all the above pattern of strings.
Assumptions:
Location and time start with #, and # appears nowhere else.
Date starts with on surrounded with obligatory white spaces, and on appears nowhere else.
Task is obligatory.
Location and date are optional and independent of one another.
Time appears only when there is date.
Task, location, date, time only appear in this order.
Also, it should be taken for granted that the regex engine is oniguruma since named capture is mentioned.
regex = /
(?<task>.*?)
(?:\s*#\s*(?<location>.*?))?
(?:\s+on\s+(?<date>.*?)
(?:\s*#\s*(?<time>.*))?
)?
\z/x
string4.match(regex)
# => #<MatchData
"bike wash # bike point on 13 may 11 # 10 AM"
task: "bike wash"
location: "bike point"
date: "13 may 11"
time: "10 AM"
>
For regular expression to do this job, some assumptions need to be made. Tasks should not include " # " or " on ", e.g, but there may be more.
To match any character but the first space for " # " or " on ", I'd use (?! # | on ).
So you could find the task using (((?! # | on ).)+). This is followed by an optional location, prefixed with " # ": (?: # ((?:(?! on ).)+))?. Note that the location should not include " on " here.
Following that, there is an optional date with an optional time: (?: on ((?:(?! # ).)+)(?: # (.+))?)?. All together:
((?:(?! # | on ).)+)(?: # ((?:(?! on ).)+))?(?: on ((?:(?! # ).)+)(?: # (.+))?)?
This will have task, location, date and time in the first four capturing groups. See here: http://regexr.com?2tnb3

Resources