ruby multiline scan between ; and negate? - ruby

I'm trying to match text between ;-.
I used:
inputx.scan(/;-.+?\n[^\n]*;-/)
but it doesn't work.
My text is:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
I need to separate the text between ;-.
For example, this is the first element of the resulting array:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
And this is second:
fly;-1
cat;4
bird;4
dragon;6
mor;-1

You may use a regex that will match any line that ends with - and 1 or more digits, and then matches any text up to the first line that ends with - and 1 or more digits:
/.*-\d+$(?m:.*?-\d+$)/
See the Rubular demo
Details:
.*-\d+$ - any 0+ chars other than line breaks, followed with - and 1+ digits
(?m:.*?-\d+$) - a modifier group where . matches line breaks matching:
.*? - any 0+ chars, as few as possible
- - a hyphen
\d+ - 1 or more digits
$ - end of line.

You can use Array#split twice, the first to split by lines, and the second to split based on the presence of either ; or ;- (using the pattern /;-?/)
The pattern /;-?/ matches a semicolon followed by an optional -.
inputx.split("\n").map{|s| s.split(/;-?/)}
#=> [[" baseball", "1"], [" norm", "4"], [" dad", "3"], [" soda", "1"], [" robot", "8"], [" mmm", "3"], [" fly", "1"], [" cat", "4"], [" bird", "4"], [" dragon", "6"], [" mor", "1"]]

A pattern with scan or split results in a regex that is needlessly complicated because it's not the best tool in the box for the problem.
I'd use something like this:
text = <<EOT
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
EOT
ary = [[]]
text.lines.each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1\n",
# " norm;4\n",
# " dad;3\n",
# " soda;1\n",
# " robot;-8\n"],
# [" fly;-1\n",
# " cat;4\n",
# " bird;4\n",
# " dragon;6\n",
# " mor;-1\n"]]
If you don't want trailing new-lines:
ary = [[]]
text.lines.map(&:chomp).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1", " norm;4", " dad;3", " soda;1", " robot;-8"],
# [" fly;-1", " cat;4", " bird;4", " dragon;6", " mor;-1"]]
If you don't want the whitespace surrounding each element:
ary = [[]]
text.lines.map(&:strip).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [["baseball;-1", "norm;4", "dad;3", "soda;1", "robot;-8"],
# ["fly;-1", "cat;4", "bird;4", "dragon;6", "mor;-1"]]
How does this work? The .. and ... operator changes meaning depending on whether it's used in the context of a Range, or in an if condition. .. is called a "flip-flop" operator, which changes state when the first condition is met. It will begin returning true at that point, and will continue to do so until the second condition is met, at which point it begins returning false again. That makes it easy to look for something, then begin acting on subsequent lines until the second condition occurs.
Normally we'd use different conditions, such as searching for "begin" and "end" in a block of lines in a file. In this case though, we needed it to not immediately toggle since both the start and end condition were the same, which is where ... comes it. It waits one loop before testing for the second condition, allowing this code to continue, find the next lines until the "closing" ';-'.
I have to say, this data set is one of the weirdest I've ever seen. (The weirdest was some binary data for the address book out of an old email program years ago). I'd be concerned about the process that's generating it, and if that generation was under my control I'd change it to use something more standard.

We can use Enumerable#chunk and Ruby's flip-flop operator. This does not require the use of a regular expression. str is the string given by the OP.
arr = str.lines.chunk do |line|
true if line.include?('-') ... line.include?('-')
end.select(&:first).map { |_,a| a.join }
#=> ["baseball;-1\nnorm;4\ndad;3\nsoda;1\nrobot;-8\n",
# "fly;-1\ncat;4\nbird;4\ndragon;6\nmor;-1\n"]
arr.each { |s| puts "\n"; puts s }
baseball;-1
norm;4
dad;3
soda;1
robot;-8
fly;-1
cat;4
bird;4
dragon;6
mor;-1
It is necessary to use three (not two) dots in the flip-flop expression (search for "three dot" in the reference given above).

Related

I could not get the intended output from the written code

I have to write a method called consonant_cancel that takes in a sentence and returns a new sentence where every word begins with its first vowel. The intended output is for given test functions are:
puts consonant_cancel("down the rabbit hole") #=> "own e abbit ole"
puts consonant_cancel("writing code is challenging") #=> "iting ode is allenging"
But I am getting "own e abbit it ole e" and "iting ing ode e is allenging enging ing" with this code.
def consonant_cancel(sentence)
arr = []
vowels = 'aeiou'
words = sentence.split
words.each do |word|
word.each_char.with_index do |char, i|
if vowels.include?(char)
arr << word[i..-1]
end
end
end
return arr.join(' ')
end
puts consonant_cancel("down the rabbit hole") #=> "own e abbit ole"
puts consonant_cancel("writing code is challenging") #=> "iting ode is allenging"
Can you guys help me to debug it?
You can use String#gsub with a regular expression. There is no need to break the string into pieces for processing and subsequent recombining.
def consonant_cancel(str)
str.gsub(/(?<![a-z])[a-z&&[^aeiou]]+/i,'')
end
consonant_cancel("down the rabbit hole")
#=> "own e abbit ole"
consonant_cancel("writing code is challenging")
#=> "iting ode is allenging"
See the section "Character Classes" in the doc for Regexp for an explanation of the && operator.
We can write the regular expression in free-spacing mode1 to make it self-documenting.
/
(?<! # Begin a negative lookbehind
[a-z] # Match a lowercase letter
) # End negative lookbehind
[a-z&&[^aeiou]]+ # Match one or more lowercase letters other than vowels
/ix # Invoke case-indifference and free-spacing modes
The negative lookahead ensures that no string of letters immediately preceded by a letter is matched. The line
[a-z&&[^aeiou]]+
can alternatively be written
[b-df-hj-np-tv-z]+
1. See the section "Free-Spacing Mode and Comments" in the doc for Regexp.
If we adding in a puts to see what's happening in your loop:
def consonant_cancel(sentence)
arr = []
vowels = 'aeiou'
words = sentence.split
words.each do |word|
word.each_char.with_index do |char, i|
if vowels.include?(char)
puts char
arr << word[i..-1]
end
end
end
return arr.join(' ')
end
Then running consonant_cancel "hello world" we see:
irb(main):044:0> consonant_cancel "hello world"
e
o
o
=> "ello o orld"
irb(main):045:0>
You'll see the same issue with any word with multiple consonants, because of the way you're looping over the characters in a word and checking for consonants.
An easier way to accomplish this would be with regular expressions.
words.split.map { |w| w.sub(/^[^aeiou]*/i, "") }.join(' ')
word.each_char.with_index
This loop iterates all chars (and vowels) of the word. break it after the first vowel found, so it does not repeat the side-effects for subsequent vowels of this word.
As an alternative, here's another regex-based solution
def consonant_cancel(sentence)
sentence.scan(/\b[^aeiou]*(.+?)\b/i).join(" ")
end

Split a string by multiple delimiters

I want to split a string by whitespaces, commas, and dots. Given this input :
"hello this is a hello, allright this is a hello."
I want to output:
hello 3
a 2
is 2
this 2
allright 1
I tried:
puts "Enter string "
text=gets.chomp
frequencies=Hash.new(0)
delimiters = [',', ' ', "."]
words = text.split(Regexp.union(delimiters))
words.each { |word| frequencies[word] +=1}
frequencies=frequencies.sort_by {|a,b| b}
frequencies.reverse!
frequencies.each { |wor,freq| puts "#{wor} #{freq}"}
This outputs:
hello 3
a 2
is 2
this 2
allright 1
1
I do not want the last line of the output. It considers the space as a
word too. This may be because there were consecutive delimiters (,, &, " ").
Use a regex:
str = 'hello this is a hello, allright this is a hello.'
str.split(/[.,\s]+/)
# => ["hello", "this", "is", "a", "hello", "allright", "this", "is", "a", "hello"]
This allows you to split a string by any of the three delimiters you've requested.
The stop and comma are self-explanatory, and the \s refers to whitespace. The + means we match one or more of these, and means we avoid empty strings in the case of 2+ of these characters in sequence.
You might find the explanation provided by Regex101 to be handy, available here: https://regex101.com/r/r4M7KQ/3.
Edit: for bonus points, here's a nice way to get the word counts using each_with_object :)
str.split(/[.,\s]+/).each_with_object(Hash.new(0)) { |word, counter| counter[word] += 1 }
# => {"hello"=>3, "this"=>2, "is"=>2, "a"=>2, "allright"=>1}

Split an array by a repetitive value

I have a variable length array of arbitrary strings. The one consistency is the string "hello" is repeated and I want to partition the array in groups by the string "hello".
So given this:
[
"hello\r\n",
"I\r\n",
"am\r\n",
"Bob\r\n",
"hello\r\n",
"How\r\n",
"are you?\r\n"
]
I want this:
[
[
"hello\r\n",
"I\r\n",
"am\r\n",
"Bob\r\n"
],
[
"hello\r\n",
"How\r\n",
"are you?\r\n"
]
]
What I have tried:
partition = []
last = input.size
index = 0
input.each_with_object([]) do |line, acc|
index += 1
if line == "hello\r\n"
acc << partition
partition = []
partition << line
else
partition << line
end
if index == last
acc << partition
end
acc
end.delete_if(&:blank?)
=> [["hello\r\n", "I\r\n", "am\r\n", "Bob\r\n"], ["hello\r\n", "How\r\n", "are you?\r\n"]]
The result is right, but is it possible to do what I want with ruby array iterators? My solution seems clunky.
You can use Enumerable#slice_before
arr.slice_before { |i| i[/hello/] }.to_a
#=> [["hello\r\n", "I\r\n", "am\r\n", "Bob\r\n"],
# ["hello\r\n", "How\r\n", "are you?\r\n"]]
or more succinctly (as pointed out by #tokland):
arr.slice_before(/hello/).to_a
Here is a method that does not use Enumerable#slice_before, which was introduced in Ruby v.2.2. It works with v1.9+ (and would work with v1.87+ if each_with_object were replaced with reduce/inject).
Assumptions
I have assumed:
all strings preceding the first string beginning with "hello" are discarded
to match "hello" the string must begin "hello" and cannot be a word merely containing hello (e.g., "hellonfire")
Code
def group_em(arr, target)
arr.each_with_object([]) { |s,a| (s =~ /\A#{target}(?!\p{alpha})/) ?
(a << [s]) : (a.last << s unless a.empty?) }
end
Example
arr = ["Ahem\r\n", "hello\r\n", "I\r\n", "hello again\r\n", "am\r\n",
"Bob\r\n", "hellonfire\r\n", "How\r\n", "are you?\r\n"]
group_em(arr, 'hello')
#=> [["hello\r\n", "I\r\n"],
# ["hello again\r\n", "am\r\n", "Bob\r\n", "hellonfire\r\n",
# "How\r\n", "are you?\r\n"]]
Note that "Ahem\r\n" is not included because it does not follow "hello" and "hellonfire\r\n" does not trigger a new slice because it does not match `"hello"``.
Discussion
In the example, the regular expression was computed to equal
/(?m-ix:\Ahello(?!\p{alpha}))/
It could instead be defined in free-spacing mode to make it self-documenting.
/
\A # match the beginning of the string
#{target} # match target word
(?!\p{alpha}) # do not match a letter (negative lookbehind)
/x # free-spacing regex definition mode

Ruby regex to get text blocks including delimiters

When using scan in Ruby, we are searching for a block within a text file.
Sample file:
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
We want the following result in an array:
["begin\nsometext\nend","begin\nsometext2\nend"]
With this scan method:
textfile.scan(/begin\s.(.*?)end/m)
we get:
["sometext","sometext2"]
We want the begin and end still in the output, not cut off.
Any suggestions?
You may remove the capturing group completely:
textfile.scan(/begin\s.*?end/m)
See the IDEONE demo
The String#scan method returns captured values only if you have capturing groups defined inside the pattern, thus a non-capturing one should fix the issue.
UPDATE
If the lines inside the blocks must be trimmed from leading/trailing whitespace, you can just use a gsub against each matched block of text to remove all the horizontal whitespace (with the help of \p{Zs} Unicode category/property class):
.scan(/begin\s.*?end/m).map { |s| s.gsub(/^\p{Zs}+|\p{Zs}+$/, "") }
Here, each match is passed to a block where /^\p{Zs}+|\p{Zs}+$/ matches either the start of a line with 1+ horizontal whitespace(s) (see ^\p{Zs}+), or 1+ horizontal whitespace(s) at the end of the line (see \p{Zs}+$).
See another IDEONE demo
Here's another approach, using Ruby's flip-flop operator. I cannot say I would recommend this approach, but Rubiests should understand how the flip-flop operator works.
First let's create a file.
str =<<_
some
text
at beginning
begin
some
text
1
end
some text
between
begin
some
text
2
end
some text at end
_
#=> "some\ntext\nat beginning\nbegin\n some\n text\n 1\nend\n...at end\n"
FName = "text"
File.write(FName, str)
Now read the file line-by-line into the array lines:
lines = File.readlines(FName)
#=> ["some\n", "text\n", "at beginning\n", "begin\n", " some\n", " text\n",
# " 1\n", "end\n", "some text\n", "between\n", "begin\n", " some\n",
# " text\n", " 2\n", "end\n", "some text at end\n"]
We can obtain the desired result as follows.
lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.
map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
The two steps are as follows.
First, select and group the lines of interest, using Enumerable#chunk with the flip-flop operator.
a = lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }
#=> #<Enumerator: #<Enumerator::Generator:0x007ff62b981510>:each>
We can see the objects that will be generated by this enumerator by converting it to an array.
a.to_a
#=> [[true, ["begin\n", " some\n", " text\n", " 1\n", "end\n"]],
# [true, ["begin\n", " some\n", " text\n", " 2\n", "end\n"]]]
Note that the flip-flop operator is distinguished from a range definition by making it part of a logical expression. For that reason we cannot write
lines.chunk { |line| line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.to_a
#=> ArgumentError: bad value for range
The second step is the following:
b = a.map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
Ruby has some great methods in Enumerable. slice_before and slice_after can help with this sort of problem:
string = <<EOT
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
EOT
ary = string.split # => ["sometextbefore", "begin", "sometext", "end", "sometextafter", "begin", "sometext2", "end", "sometextafter2"]
.slice_after(/^end/) # => #<Enumerator: #<Enumerator::Generator:0x007fb1e20b42a8>:each>
.map{ |a| a.shift; a } # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"], []]
ary.pop # => []
ary # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"]]
If you want the resulting sub-arrays joined then that's an easy step:
ary.map{ |a| a.join("\n") } # => ["begin\nsometext\nend", "begin\nsometext2\nend"]

frequency of a letter in a string

When trying to find the frequency of letters in 'fantastic' I am having trouble understanding the given solution:
def letter_count(str)
counts = {}
str.each_char do |char|
next if char == " "
counts[char] = 0 unless counts.include?(char)
counts[char] += 1
end
counts
end
I tried deconstructing it and when I created the following piece of code I expected it to do the exact same thing. However it gives me a different result.
blah = {}
x = 'fantastic'
x.each_char do |char|
next if char == " "
blah[char] = 0
unless
blah.include?(char)
blah[char] += 1
end
blah
end
The first piece of code gives me the following
puts letter_count('fantastic')
>
{"f"=>1, "a"=>2, "n"=>1, "t"=>2, "s"=>1, "i"=>1, "c"=>1}
Why does the second piece of code give me
puts blah
>
{"f"=>0, "a"=>0, "n"=>0, "t"=>0, "s"=>0, "i"=>0, "c"=>0}
Can someone break down the pieces of code and tell me what the underlying difference is. I think once I understand this I'll be able to really understand the first piece of code. Additionally if you want to explain a bit about the first piece of code to help me out that'd be great as well.
You can't split this line...
counts[char] = 0 unless counts.include?(char)
... over multiple line the way you did it. The trailing conditional only works on a single line.
If you wanted to split it over multiple lines you would have to convert to traditional if / end (in this case unless / end) format.
unless counts.include?(char)
counts[char] = 0
end
Here's the explanation of the code...
# we define a method letter_count that accepts one argument str
def letter_count(str)
# we create an empty hash
counts = {}
# we loop through all the characters in the string... we will refer to each character as char
str.each_char do |char|
# we skip blank characters (we go and process the next character)
next if char == " "
# if there is no hash entry for the current character we initialis the
# count for that character to zero
counts[char] = 0 unless counts.include?(char)
# we increase the count for the current character by 1
counts[char] += 1
# we end the each_char loop
end
# we make sure the hash of counts is returned at the end of this method
counts
# end of the method
end
Now that #Steve has answered your question and you have accepted his answer, perhaps I can suggest another way to count the letters. This is just one of many approaches that could be taken.
Code
def letter_count(str)
str.downcase.each_char.with_object({}) { |c,h|
(h[c] = h.fetch(c,0) + 1) if c =~ /[a-z]/ }
end
Example
letter_count('Fantastic')
#=> {"f"=>1, "a"=>2, "n"=>1, "t"=>2, "s"=>1, "i"=>1, "c"=>1}
Explanation
Here is what's happening.
str = 'Fantastic'
We use String#downcase so that, for example, 'f' and 'F' are treated as the same character for purposes of counting. (If you don't want that, simply remove .downcase.) Let
s = str.downcase #=> "fantastic"
In
s.each_char.with_object({}) { |c,h| (h[c] = h.fetch(c,0) + 1) c =~ /[a-z]/ }
the enumerator String#each_char is chained to Enumerator#with_index. This creates a compound enumerator:
enum = s.each_char.with_object({})
#=> #<Enumerator: #<Enumerator: "fantastic":each_char>:with_object({})>
We can view what the enumerator will pass to the block by converting it to an array:
enum.to_a
#=> [["f", {}], ["a", {}], ["n", {}], ["t", {}], ["a", {}],
# ["s", {}], ["t", {}], ["i", {}], ["c", {}]]
(Actually, it only passes an empty hash with 'f'; thereafter it passes the updated value of the hash.) The enumerator with_object creates an empty hash denoted by the block variable h.
The first element enum passes to the block is the string 'f'. The block variable c is assigned that value, so the expression in the block:
(h[c] = h.fetch(c,0) + 1) if c =~ /[a-z]/
evaluates to:
(h['f'] = h.fetch('f',0) + 1) if 'f' =~ /[a-z]/
Now
c =~ /[a-z]/
is true if and only if c is a lowercase letter. Here
'f' =~ /[a-z]/ #=> true
so we evaluate the expression
h[c] = h.fetch(c,0) + 1
h.fetch(c,0) returns h[c] if h has a key c; else it returns the value of Hash#fetch's second parameter, which here is zero. (fetch can also take a block.)
Since h is now empty, it becomes
h['f'] = 0 + 1 #=> 1
The enumerator each_char then passes 'a', 'n' and 't' to the block, resulting in the hash becoming
h = {'f'=>1, 'a'=>1, 'n'=>1, 't'=>1 }
The next character passed in is a second 'a'. As h already has a key 'a',
h[c] = h.fetch(c,0) + 1
evaluates to
h['a'] = h['a'] + 1 #=> 1 + 1 => 2
The remainder of the string is processed the same way.

Resources