I want to split a string into chunks, each of which is within a maximum character count, say 2000 and does not split a word.
I have tried doing as below:
text.chars.each_slice(2000).map(&:join)
but sometimes, words are split. I have tried some regex:
text.scan(/.{1,2000}\b|.{1,2000}/).map(&:strip)
from this question, but I don't quite get how it works and it gives me some erratic behavior, sometimes giving chunks that only contain periods.
Any pointers will be greatly appreciated.
Code
def max_groups(str, n)
arr = []
pos = 0
loop do
break arr if pos == str.size
m = str.match(/.{1,#{n}}(?=[ ]|\z)|.{,#{n-1}}[ ]/, pos)
return nil if m.nil?
arr << m[0]
pos += m[0].size
end
end
Examples
str = "Now is the time for all good people to party"
# 12345678901234567890123456789012345678901234
# 0 1 2 3 4
max_groups(str, 5)
#=> nil
max_groups(str, 6)
#=> ["Now is", " the ", "time ", "for ", "all ", "good ", "people", " to
max_groups(str, 10)
#=> ["Now is the", " time for ", "all good ", "people to ", "party"]
max_groups(str, 14)
#=> ["Now is the ", "time for all ", "good people to", " party"]
max_groups(str, 15)
#=> ["Now is the time", " for all good ", "people to party"]
max_groups(str, 29)
#=> ["Now is the time for all good ", "people to party"]
max_groups(str, 43)
#=> ["Now is the time for all good people to ", "party"]
max_groups(str, 44)
#=> ["Now is the time for all good people to party"]
str = "How you do?"
# 123456789012345678
# 0 1
max_groups(str, 4)
#=> ["How ", " ", " ", "you ", "do?"]
You could do a Notepad style word wrap.
Just construct the regex using the maximum characters per line quantifier range {1,N}.
The example below uses 32 max per line.
https://regex101.com/r/8vAkOX/1
Update: To include linebreaks within the range, add the dot-all modifier (?s)
Otherwise, stand alone linebreaks are filtered.
(?s)(?:((?>.{1,32}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,32})(?:\r?\n)?|(?:\r?\n|$))
The chunks are in $1, and you could replace with $1\r\n to get a display
that looks wrapped.
Explained
(?s) # Span line breaks
(?:
# -- Words/Characters
( # (1 start)
(?> # Atomic Group - Match words with valid breaks
.{1,32} # 1-N characters
# Followed by one of 4 prioritized, non-linebreak whitespace
(?: # break types:
(?<= [^\S\r\n] ) # 1. - Behind a non-linebreak whitespace
[^\S\r\n]? # ( optionally accept an extra non-linebreak whitespace )
| (?= \r? \n ) # 2. - Ahead a linebreak
| $ # 3. - EOS
| [^\S\r\n] # 4. - Accept an extra non-linebreak whitespace
)
) # End atomic group
|
.{1,32} # No valid word breaks, just break on the N'th character
) # (1 end)
(?: \r? \n )? # Optional linebreak after Words/Characters
|
# -- Or, Linebreak
(?: \r? \n | $ ) # Stand alone linebreak or at EOS
)
This is what worked for me (thanks to #StefanPochmann's comments):
text = "Some really long string\nwith some line breaks"
The following will first remove all whitespace before breaking the string up.
text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
The resulting chunks of strings will lose all the line breaks (\n) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br), that you can use to restore the line breaks later. Like this:
text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")
After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \n like this:
chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}
Looks like a long process but it worked for me.
Related
Input strings:
str1 = "$13.90 Price as Shown"
str2 = "$590.50 $490.00 Price as Selected"
str3 = "$9.90 or 5/$27.50 Price as Selected"
Output strings:
str1 = "13.90"
str2 = "490.00"
str3 = "9.90"
My code to get output:
str = str.strip.gsub(/\s\w{2}\s\d\/\W\d+.\d+/, "") # remove or 5/$27.50 from string
str = /\W\d+.\d+\s\w+/.match(str).to_s.gsub("$", "").gsub(" Price", "")
This code works fine for all 3 different types of strings. But how can I improve my code? Are there any better solutions?
Also guys can you give link to good regex guide/book?
A regex I suggested first is just a sum total of your regexps:
(?<=(?<!\/)\$)\d+.\d+(?=\s\w+)
See demo
Since it is next to impossible to compare numbers with regex, I suggest
Extracting all float numbers
Parse them as float values
Get the minimum one
Here is a working snippet:
def getLowestNumberFromString(input)
arr = input.scan(/(?<=(?<!\/)\$)\d+(?:\.\d+)?/)
arr.collect do |value|
value.to_f
end
return arr.min
end
puts getLowestNumberFromString("$13.90 Price as Shown")
puts getLowestNumberFromString("$590.50 $490.00 Price as Selected")
puts getLowestNumberFromString("$9.90 or 5/$27.50 Price as Selected")
The regex breakdown:
(?<=(?<!\/)\$) - assert that there is a $ symbol not preceded with / right before...
\d+ - 1 or more digits
(?:\.\d+)? - optionally followed with a . followed by 1 or more digits
Note that if you only need to match floats with decimal part, remove the ? and non-capturing group from the last subpattern (/(?<=(?<!\/)\$)\d+\.\d+/ or even /(?<=(?<!\/)\$)\d*\.?\d+/).
Supposing input can be relied upon to look like one of your three examples, how about this?
expr = /\$(\d+\.\d\d)\s+(?:or\s+\d+\/\$\d+\.\d\d\s+)?Price/
str = "$9.90 or 5/$27.50 Price as Selected"
str[expr, 1] # => "9.90"
Here it is on Rubular: http://rubular.com/r/CakoUt5Lo3
Explained:
expr = %r{
\$ # literal dollar sign
(\d+\.\d\d) # capture a price with two decimal places (assume no thousands separator)
\s+ # whitespace
(?: # non-capturing group
or\s+ # literal "or" followed by whitespace
\d+\/ # one or more digits followed by literal "/"
\$\d+\.\d\d # dollar sign and price
\s+ # whitespace
)? # preceding group is optional
Price # the literal word "Price"
}x
You might use it like this:
MATCH_PRICE_EXPR = /\$(\d+\.\d\d)\s+(?:or\s+\d+\/\$\d+\.\d\d\s+)?Price/
def match_price(input)
return unless input =~ MATCH_PRICE_EXPR
$1.to_f
end
puts match_price("$13.90 Price as Shown")
# => 13.9
puts match_price("$590.50 $490.00 Price as Selected")
# => 490.0
puts match_price("$9.90 or 5/$27.50 Price as Selected")
# => 9.9
My code works fine for all 3 types of strings. Just wondering how can
I improve that code
str = str.gsub(/ or \d\/[\$\d.]+/i, '')
str = /(\$[\d.]+) P/.match(str)
Ruby Live Demo
http://ideone.com/18XMjr
A better regex is probably: /\B\$(\d+\.\d{2})\b/
str = "$590.50 $490.00 Price as Selected"
str.scan(/\B\$(\d+\.\d{2})\b/).flatten.min_by(&:to_f)
#=> "490.00"
Assuming you simply want the smallest dollar value in each line:
r = /
\$ # match a dollar sign
\d+ # match one or more digits
\. # match a decimal point
\d{2} # match two digits
/x # extended mode
[str1, str2, str3].map { |s| s.scan(r).min_by { |s| s[1..-1].to_f } }
#=> ["$13.90", "$490.00", "$9.90"]
Actually, you don't have to use a regex. You could do it like this:
def smallest(str)
val = str.each_char.with_index(1).
select { |c,_| c == ?$ }.
map { |_,i| str[i..-1].to_f }.
min
"$%.2f" % val
end
smallest(str1) #=> "$13.90"
smallest(str2) #=> "$490.00"
smallest(str3) #=> "$9.90"
I try to catch only cases b and d from sample below (ie. END should be the only word on a line (or at least be a word not part of longer word, and END should be at beginning of line (not necessarily ^, could start from column #2, case \i.)
I cannot combine this all togethernin one regex, can I have more then 1 flag in regex? I also need this OR in this regex too.
Thanks all.
M
regexDrop = /String01|String2|\AEND/i #END\n/i
a = "the long END not begin of line"
b = "ENd" # <#><< need this one
c = "END MORE WORDs"
d =" EnD" # <#><< need this one
if a =~ regexDrop then puts "a__Match: " + a else puts 'a_' end
if b =~ regexDrop then puts "b__Match: " + b else puts 'b_' end
if c =~ regexDrop then puts "c__Match: " + c else puts 'c_' end
if d =~ regexDrop then puts "d__Match: " + d else puts 'd_' end
## \w Matches word characters.
## \A Matches beginning of string. (could be not column 1)
Note that \A is an anchor (a kind of a built-in lookehind, or "zero width assertion", that matches the beginning of a whole string. The \w is a shorthand class matching letters, digits and an underscore (word characters).
Judging by your description and sample input and expected output, I think you are just looking for END anywhere in a string as a whole word and case-insensitive.
You can match the instances with
regexDrop = /String01|String2|\bEND\b/i
Here is a demo
Output:
a__Match: the long END not begin of line
b__Match: ENd
c__Match: END MORE WORDs
d__Match: EnD
I answered my own question. Forgot to initialize count = 0
I have a bunch of sentences in a paragraph.
a = "Hello there. this is the best class. but does not offer anything." as an example.
To figure out if the first letter is capitalized, my thought is to .split the string so that a_sentence = a.split(".")
I know I can "hello world".capitalize! so that if it was nil it means to me that it was already capitalized
EDIT
Now I can use array method to go through value and use '.capitalize!
And I know I can check if something is .strip.capitalize!.nil?
But I can't seem to output how many were capitalized.
EDIT
a_sentence.each do |sentence|
if (sentence.strip.capitalize!.nil?)
count += 1
puts "#{count} capitalized"
end
end
It outputs:
1 capitalized
Thanks for all your help. I'll stick with the above code I can understand within the framework I only know in Ruby. :)
Try this:
b = []
a.split(".").each do |sentence|
b << sentence.strip.capitalize
end
b = b.join(". ") + "."
# => "Hello there. This is the best class. But does not offer anything."
Your post's title is misleading because from your code, it seems that you want to get the count of capitalized letters at the beginning of a sentence.
Assuming that every sentence is finishing on a period (a full stop) followed by a space, the following should work for you:
split_str = ". "
regex = /^[A-Z]/
paragraph_text.split(split_str).count do |sentence|
regex.match(sentence)
end
And if you want to simply ensure that each starting letter is capitalized, you could try the following:
paragraph_text.split(split_str).map(&:capitalize).join(split_str) + split_str
There's no need to split the string into sentences:
str = "It was the best of times. sound familiar? Out, damn spot! oh, my."
str.scan(/(?:^|[.!?]\s)\s*\K[A-Z]/).length
#=> 2
The regex could be written with documentation by adding x after the closing /:
r = /
(?: # start a non-capture group
^|[.!?]\s # match ^ or (|) any of ([]) ., ! or ?, then one whitespace char
) # end non-capture group
\s* # match any number of whitespace chars
\K # forget the preceding match
[A-Z] # match one capital letter
/x
a = str.scan(r)
#=> ["I", "O"]
a.length
#=> 2
Instead of Array#length, you could use its alias, size, or Array#count.
You can count how many were capitalized, like this:
a = "Hello there. this is the best class. but does not offer anything."
a_sentence = a.split(".")
a_sentence.inject(0) { |sum, s| s.strip!; s.capitalize!.nil? ? sum += 1 : sum }
# => 1
a_sentence
# => ["Hello there", "This is the best class", "But does not offer anything"]
And then put it back together, like this:
"#{a_sentence.join('. ')}."
# => "Hello there. This is the best class. But does not offer anything."
EDIT
As #Humza sugested, you could use count:
a_sentence.count { |s| s.strip!; s.capitalize!.nil? }
# => 1
Using Ruby, I want to find a regular expression that correctly identifies sentence boundaries, which I am defining as any string that ends in [.!?] except when these punctuation marks exist within quotation marks, as in
My friend said "John isn't here!" and then he left.
My current code that is falling short is:
text = para.text.scan(/[^\.!?]+[(?<!(.?!)\"|.!?] /).map(&:strip)
I've pored over the regex docs, but still can't seem to understand lookbacks/lookaheads correctly.
How about something like this?
/(?:"(?>[^"]|\\.)+"|[a-z]\.[a-z]\.|[^.?!])+[!.?]/gi
Demo: https://regex101.com/r/bJ8hM5/2
How it works:
The regex, will at each position in the string, check for the following
A quoted string in the form of "quote" which can contain anything up until the ending quote. You can also have escaped quotes, such as "hell\"o".
Match any letter, followed by a dot, followed by another letter, and finally a dot. This is to match your special case of U.S. etc.
Match everything else that isn't a punctation character .?!.
Repeat up until we reach a punctation character.
Here's a partial-regex solution that disregards sentence terminators that are contained between double-quotes.
Code
def extract_sentences(str, da_terminators)
start_with_quote = (str[0] == '"')
str.split(/(\".*?\")/)
.flat_map.with_index { |b,i|
(start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
.slice_after(/^[#{da_terminators}]$/)
.map { |sb| sb.join.strip }
end
Example
puts extract_sentences(str, '!?.')
# My friend said "John isn't here!", then "I'm outta' here" and then he left.
# Let's go!
# Later, he said "Aren't you coming?"
Explanation
For str above and
da_terminators = '!?.'
We will need the following later:
start_with_quote = (str[0] == '"')
#=> false
Split the string on "...". We need to make \".*?\" a capture group in order to keep it in the split. The result is an array, block that alternately has strings surrounded by double quotes and other strings. start_with_quote tells us which is which.
blocks = str.split(/(\".*?\")/)
#=> ["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left. Let's go! Later, he said ",
# "\"Aren't you coming?\""]
Split the string elements that are not surrounded by double quotes. The split is on any of the sentence terminating characters. Again it must be in a capture group in order to keep the separator.
new_blocks = blocks.flat_map.with_index { |b,i|
(start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
#=> ["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left",
# ".",
# " Let's go",
# "!",
# " Later, he said ",
# "\"Aren't you coming?\""
sentence_blocks_enum = new_blocks.slice_after(/^[#{da_terminators}]$/)
# #<Enumerator:0x007f9a3b853478>
Convert this enumerator to an array to see what it will pass into its block:
sentence_blocks_enum.to_a
#=> [["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left", "."],
# [" Let's go", "!"],
# [" Later, he said ", "\"Aren't you coming?\""]]
Combine the blocks of each sentence and strip whitespace, and return the array:
sentence_blocks_enum.map { |sb| sb.join.strip }
#=> ["My friend said \"John isn't here!\", then \"I'm outta' here\" and then he left.",
# "Let's go!",
# "Later, he said \"Aren't you coming?\""]
I am trying to count the characters in a text file excluding white spaces. My thought was to use scan; however, the tutorial I am reading uses gsub. There is a difference in output between the two, and I was wondering why. Here are the two code blocks; the gsub version is the one that's giving me the correct output:
total_characters_nospaces = text.gsub(/\s+/, '').length
puts "#{total_characters_nospaces} characters excluding spaces."
And the other one:
chars = 0
totes_chars_no = text.scan(/\w/){|everything| chars += 1 }
puts chars
The opposite of \s is not \w - it is \S.
\w is equivalent to [a-zA-Z0-9_]. It does not include many other characters such as punctuation.
\S is the exact opposite of \s - it includes any character that is not whitespace.
Now that your question has been answered, here are a couple other ways you could do it:
s = "now is the time for all good"
s.count "^\s" # => 22
s.each_char.reduce(0) { |count, c| count + (c =~ /\S/ ? 1 : 0) } # => 22