Split a string by multiple delimiters

Split a string by multiple delimiters - ruby

I want to split a string by whitespaces, commas, and dots. Given this input :
"hello this is a hello, allright this is a hello."
I want to output:
hello 3
a 2
is 2
this 2
allright 1
I tried:
puts "Enter string "
text=gets.chomp
frequencies=Hash.new(0)
delimiters = [',', ' ', "."]
words = text.split(Regexp.union(delimiters))
words.each { |word| frequencies[word] +=1}
frequencies=frequencies.sort_by {|a,b| b}
frequencies.reverse!
frequencies.each { |wor,freq| puts "#{wor} #{freq}"}
This outputs:
hello 3
a 2
is 2
this 2
allright 1
1
I do not want the last line of the output. It considers the space as a
word too. This may be because there were consecutive delimiters (,, &, " ").

Use a regex:
str = 'hello this is a hello, allright this is a hello.'
str.split(/[.,\s]+/)
# => ["hello", "this", "is", "a", "hello", "allright", "this", "is", "a", "hello"]
This allows you to split a string by any of the three delimiters you've requested.
The stop and comma are self-explanatory, and the \s refers to whitespace. The + means we match one or more of these, and means we avoid empty strings in the case of 2+ of these characters in sequence.
You might find the explanation provided by Regex101 to be handy, available here: https://regex101.com/r/r4M7KQ/3.
Edit: for bonus points, here's a nice way to get the word counts using each_with_object :)
str.split(/[.,\s]+/).each_with_object(Hash.new(0)) { |word, counter| counter[word] += 1 }
# => {"hello"=>3, "this"=>2, "is"=>2, "a"=>2, "allright"=>1}

Related

How do I extract the part of a string whose individual words begin with letters?

I'm using Ruby 2.4. Let's say I have a string that has a number of spaces in it
str = "abc def 123ffg"
How do I capture all the consecutive words at the beginning of the string that begin with a letter? So for example, in the above, I would want to capture
"abc def"
And if I had a string like
"aa22 b cc 33d ff"
I would want to capture
"aa22 b cc"
but if my string were
"66dd eee ff"
I would want to return nothing because the first word of that string does not begin with a letter.

If you can spare the extra spaces between words, you could then split the string and iterate the resulting array with take_while, using a regex to get the desired output; something like this:
str = "abc def 123ffg"
str.split.take_while { |word| word[0] =~ /[[:alpha:]]/ }
#=> ["abc", "def"]
The output is an array, but if a string is needed, you could use join at the end:
str.split.take_while { |word| word[0] =~ /[[:alpha:]]/ }.join(" ")
#=> "abc def"
More examples:
"aa22 b cc 33d ff".split.take_while { |word| word[0] =~ /[[:alpha:]]/ }
#=> ["aa22", "b", "cc"]
"66dd eee ff".split.take_while { |word| word[0] =~ /[[:alpha:]]/ }
#=> []

The Regular Expression
There's usually more than one way to match a pattern, although some are simpler than others. A relatively simple regular express that works with your inputs and expected outputs is as follows:
/(?:(?:\A|\s*)\p{L}\S*)+/
This matches one or more strings when all of the following conditions are met:
start-of-string, or zero or more whitespace characters
followed by a Unicode category of "letter"
followed by zero or more non-whitespace characters
The first item in the list, which is the second non-capturing group, is what allows the match to be repeated until a word starts with a non-letter.
The Proofs
regex = /(?:(?:\A|\s*)\p{L}\S*)+/
regex.match 'aa22 b cc 33d ff' #=> #<MatchData "aa22 b cc">
regex.match 'abc def 123ffg' #=> #<MatchData "abc def">
regex.match '66dd eee ff' #=> #<MatchData "">

The sub method can be used to replace with an empty string '' everything that needs to be removed from the expression.
In this case, a first sub method is needed to remove the whole text if it starts with a digit. Then another sub will remove everything starting from any word that starts with a digit.
Answer:
str.sub(/^\d+.*/, '').sub(/\s+\d+.*/, '')
Outputs:
str = "abc def 123ffg"
# => "abc def"
str = "aa22 b cc 33d ff"
# => "aa22 b cc"
str = "66dd eee ff"
# => ""

How to split string in ruby [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I have a string:
"1 chocolate bar at 25"
and I want to split this string into:
[1, "chocolate bar", 25]
I don't know how to write a regex for this split. And I wanted to know whether there are any other functions to accomplish it.

You could use scan with a regex:
"1 chocolate bar at 25".scan(/^(\d+) ([\w ]+) at (\d+)$/).first
The above method doesn't work if item_name has special characters.
If you want a more robust solution, you can use split:
number1, *words, at, number2 = "1 chocolate bar at 25".split
p [number1, words.join(' '), number2]
# ["1", "chocolate bar", "25"]
number1 is the first part, number2 is the last one, at the second to last, and *words is an array with everything in-between. number2 is guaranteed to be the last word.
This method has the advantage of working even if there are numbers in the middle, " at " somewhere in the string or if prices are given as floats.

It is not necessary to use a regular expression.
str = "1 chocolate bar, 3 donuts and a 7up at 25"
i1 = str.index(' ')
#=> 1
i2 = str.rindex(' at ')
#=> 35
[str[0,i1].to_i, str[i1+1..i2-1], str[i2+3..-1].to_i]
#=> [1, "chocolate bar, 3 donuts and a 7up", 25]

I would do:
> s="1 chocolate bar at 25"
> s.scan(/[\d ]+|[[:alpha:] ]+/)
=> ["1 ", "chocolate bar at ", "25"]
Then to get the integers and the stripped string:
> s.scan(/[\d ]+|[[:alpha:] ]+/).map {|s| Integer(s) rescue s.strip}
=> [1, "chocolate bar at", 25]
And to remove the " at":
> s.scan(/[\d ]+|[[:alpha:] ]+/).map {|s| Integer(s) rescue s[/.*(?=\s+at\s*)/]}
=> [1, "chocolate bar", 25]

You may try returning captures property of match method on regex (\d+) ([\w ]+) at (\d+):
string.match(/(\d+) +(\D+) +at +(\d+)/).captures
Live demo
Validating input string
If you didn't validate your input string to be within desired format already, then there may be a better approach in validating and capturing data. This solution also brings the idea of accepting any type of character in item_name field and decimal prices at the end:
string.match(/^(\d+) +(.*) +at +(\d+(?:\.\d+)?)$/).captures

You can also do something like this:
"1 chocolate bar at 25"
.split()
.reject {|string| string == "at" }
.map {|string| string.scan(/^\D+$/).empty? ? string.to_i : string }
Code Example: http://ideone.com/s8OvlC

I live in the country where prices might be float, hence the more sophisticated matcher for the price.
"1 chocolate bar at 25".
match(/\A(\d+)\s+(.*?)\s+at\s+(\d[.\d]*)\z/).
captures
#⇒ ["1", "chocolate bar", "25"]

ruby multiline scan between ; and negate?

I'm trying to match text between ;-.
I used:
inputx.scan(/;-.+?\n[^\n]*;-/)
but it doesn't work.
My text is:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
I need to separate the text between ;-.
For example, this is the first element of the resulting array:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
And this is second:
fly;-1
cat;4
bird;4
dragon;6
mor;-1

You may use a regex that will match any line that ends with - and 1 or more digits, and then matches any text up to the first line that ends with - and 1 or more digits:
/.*-\d+$(?m:.*?-\d+$)/
See the Rubular demo
Details:
.*-\d+$ - any 0+ chars other than line breaks, followed with - and 1+ digits
(?m:.*?-\d+$) - a modifier group where . matches line breaks matching:
.*? - any 0+ chars, as few as possible
- - a hyphen
\d+ - 1 or more digits
$ - end of line.

You can use Array#split twice, the first to split by lines, and the second to split based on the presence of either ; or ;- (using the pattern /;-?/)
The pattern /;-?/ matches a semicolon followed by an optional -.
inputx.split("\n").map{|s| s.split(/;-?/)}
#=> [[" baseball", "1"], [" norm", "4"], [" dad", "3"], [" soda", "1"], [" robot", "8"], [" mmm", "3"], [" fly", "1"], [" cat", "4"], [" bird", "4"], [" dragon", "6"], [" mor", "1"]]

A pattern with scan or split results in a regex that is needlessly complicated because it's not the best tool in the box for the problem.
I'd use something like this:
text = <<EOT
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
EOT
ary = [[]]
text.lines.each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1\n",
# " norm;4\n",
# " dad;3\n",
# " soda;1\n",
# " robot;-8\n"],
# [" fly;-1\n",
# " cat;4\n",
# " bird;4\n",
# " dragon;6\n",
# " mor;-1\n"]]
If you don't want trailing new-lines:
ary = [[]]
text.lines.map(&:chomp).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1", " norm;4", " dad;3", " soda;1", " robot;-8"],
# [" fly;-1", " cat;4", " bird;4", " dragon;6", " mor;-1"]]
If you don't want the whitespace surrounding each element:
ary = [[]]
text.lines.map(&:strip).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [["baseball;-1", "norm;4", "dad;3", "soda;1", "robot;-8"],
# ["fly;-1", "cat;4", "bird;4", "dragon;6", "mor;-1"]]
How does this work? The .. and ... operator changes meaning depending on whether it's used in the context of a Range, or in an if condition. .. is called a "flip-flop" operator, which changes state when the first condition is met. It will begin returning true at that point, and will continue to do so until the second condition is met, at which point it begins returning false again. That makes it easy to look for something, then begin acting on subsequent lines until the second condition occurs.
Normally we'd use different conditions, such as searching for "begin" and "end" in a block of lines in a file. In this case though, we needed it to not immediately toggle since both the start and end condition were the same, which is where ... comes it. It waits one loop before testing for the second condition, allowing this code to continue, find the next lines until the "closing" ';-'.
I have to say, this data set is one of the weirdest I've ever seen. (The weirdest was some binary data for the address book out of an old email program years ago). I'd be concerned about the process that's generating it, and if that generation was under my control I'd change it to use something more standard.

We can use Enumerable#chunk and Ruby's flip-flop operator. This does not require the use of a regular expression. str is the string given by the OP.
arr = str.lines.chunk do |line|
true if line.include?('-') ... line.include?('-')
end.select(&:first).map { |_,a| a.join }
#=> ["baseball;-1\nnorm;4\ndad;3\nsoda;1\nrobot;-8\n",
# "fly;-1\ncat;4\nbird;4\ndragon;6\nmor;-1\n"]
arr.each { |s| puts "\n"; puts s }
baseball;-1
norm;4
dad;3
soda;1
robot;-8
fly;-1
cat;4
bird;4
dragon;6
mor;-1
It is necessary to use three (not two) dots in the flip-flop expression (search for "three dot" in the reference given above).

Ruby regex to get text blocks including delimiters

When using scan in Ruby, we are searching for a block within a text file.
Sample file:
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
We want the following result in an array:
["begin\nsometext\nend","begin\nsometext2\nend"]
With this scan method:
textfile.scan(/begin\s.(.*?)end/m)
we get:
["sometext","sometext2"]
We want the begin and end still in the output, not cut off.
Any suggestions?

You may remove the capturing group completely:
textfile.scan(/begin\s.*?end/m)
See the IDEONE demo
The String#scan method returns captured values only if you have capturing groups defined inside the pattern, thus a non-capturing one should fix the issue.
UPDATE
If the lines inside the blocks must be trimmed from leading/trailing whitespace, you can just use a gsub against each matched block of text to remove all the horizontal whitespace (with the help of \p{Zs} Unicode category/property class):
.scan(/begin\s.*?end/m).map { |s| s.gsub(/^\p{Zs}+|\p{Zs}+$/, "") }
Here, each match is passed to a block where /^\p{Zs}+|\p{Zs}+$/ matches either the start of a line with 1+ horizontal whitespace(s) (see ^\p{Zs}+), or 1+ horizontal whitespace(s) at the end of the line (see \p{Zs}+$).
See another IDEONE demo

Here's another approach, using Ruby's flip-flop operator. I cannot say I would recommend this approach, but Rubiests should understand how the flip-flop operator works.
First let's create a file.
str =<<_
some
text
at beginning
begin
some
text
1
end
some text
between
begin
some
text
2
end
some text at end
_
#=> "some\ntext\nat beginning\nbegin\n some\n text\n 1\nend\n...at end\n"
FName = "text"
File.write(FName, str)
Now read the file line-by-line into the array lines:
lines = File.readlines(FName)
#=> ["some\n", "text\n", "at beginning\n", "begin\n", " some\n", " text\n",
# " 1\n", "end\n", "some text\n", "between\n", "begin\n", " some\n",
# " text\n", " 2\n", "end\n", "some text at end\n"]
We can obtain the desired result as follows.
lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.
map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
The two steps are as follows.
First, select and group the lines of interest, using Enumerable#chunk with the flip-flop operator.
a = lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }
#=> #<Enumerator: #<Enumerator::Generator:0x007ff62b981510>:each>
We can see the objects that will be generated by this enumerator by converting it to an array.
a.to_a
#=> [[true, ["begin\n", " some\n", " text\n", " 1\n", "end\n"]],
# [true, ["begin\n", " some\n", " text\n", " 2\n", "end\n"]]]
Note that the flip-flop operator is distinguished from a range definition by making it part of a logical expression. For that reason we cannot write
lines.chunk { |line| line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.to_a
#=> ArgumentError: bad value for range
The second step is the following:
b = a.map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]

Ruby has some great methods in Enumerable. slice_before and slice_after can help with this sort of problem:
string = <<EOT
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
EOT
ary = string.split # => ["sometextbefore", "begin", "sometext", "end", "sometextafter", "begin", "sometext2", "end", "sometextafter2"]
.slice_after(/^end/) # => #<Enumerator: #<Enumerator::Generator:0x007fb1e20b42a8>:each>
.map{ |a| a.shift; a } # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"], []]
ary.pop # => []
ary # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"]]
If you want the resulting sub-arrays joined then that's an easy step:
ary.map{ |a| a.join("\n") } # => ["begin\nsometext\nend", "begin\nsometext2\nend"]

Use Ruby program. Input: sentence Modify: words Output: modified sentence

I am new to Ruby. This is a programming interview question to use any language. I am trying to do it in Ruby.
Write a program to input a given sentence. Replace each word with the firstletter/#ofcharactersbetween1st&lastletter/lastletter of the word. All non-alpha (numbers, punctuation, etc.) should not be changed.
Example input: There are 12 chickens for 2 roosters.
Desired output: T3e a1e 12 c6s f1r 2 r6s.
I have the concept but need help with better approach and how to put the parts together:
s="There are 12 chickens for 2 roosters."
..
=> "There are 12 chickens for 2 roosters."
a = s.split(" ")
=> ["There", "are", "12", "chickens", "for", "2", "roosters."]
puts a.length
7
=> nil
puts a[0].length
5
=> nil
puts a[0].length-2
3
=> nil
puts a[0][0]
84
=> nil
puts a[0][0].chr
T
=> nil
puts a[0].length-2
3
=> nil
puts a[0][-1].chr
e
=> nil

Try this:
s = "There are 12 chickens for 2 roosters."
s.gsub(/([A-Za-z]+)/) { $1[0] + ($1.size - 2).to_s + $1[-1] }
It uses gsub which replaces all parts of the string matching the regular expression pattern.
The pattern in this case is /([A-Za-z]+)/ and groups occurrences of one or more characters in the ranges A-Z and a-z.
{ $1[0] + ($1.size - 2).to_s + $1[-1] } is a block executed for every occurrence. $1 is the first group matched in the pattern. The block replaces the occurrence with its first character $1[0], its length -2 to string ($1.size - 2).to_s and its last character $1[-1].

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Split a string by multiple delimiters - ruby

Related

How do I extract the part of a string whose individual words begin with letters?

How to split string in ruby [duplicate]

ruby multiline scan between ; and negate?

Ruby regex to get text blocks including delimiters

Use Ruby program. Input: sentence Modify: words Output: modified sentence

Categories

Resources