Splitting a string on variable numbers of words - ruby

The following question was posted by #ruhroe about an hour ago. I was about to post an answer when it was taken down. That's unfortunate, as I thought it was rather interesting. I'm putting it back up in case the OP sees this and also to give others an opportunity to post solutions.
The original question (which I've edited):
The problem is to split a string on some spaces in the string, based on criteria which depend in part on a number given by the user. If that number were, say, 5, each substring would contain either:
one word having 5 or more characters or
as many consecutive words (separated by spaces) as possible, provided the resulting string has at most 5 characters.
For example, if the string were:
"abcdefg fg hijkl mno pqrs tuv wx yz"
the result would be:
["abcdefg", "fg", "hijkl", "mno", "pqrs", "tuv", "wx yz"]
"abcdefg" is on a separate line because it has at least five characters.
"fg" is on a separate line because "fg" contains 5 or few characters and when combined with the following word, with a space between them, the resulting string, "fg hijkl", contains more than 5 characters.
"hijkl" is on a separate line because it satisfies both criteria.
How can I do that?

I believe this does it:
str = "abcdefg fg hijkl e mn pqrs tuv wx yz"
str.scan(/\b(?:\w{5,}|\w[\w\s]{0,3}\w|\w)\b/)
#=> ["abcdefg", "fg", "hijkl", "e mn", "pqrs", "tuv", "wx yz"]

As you iterate through the words in your collection (splitting the original string up into words should be trivial), it seems like there are three possible scenarios:
It's a blank line, and we should insert the current word into the line
It's a non-blank line, and the word can fit
It's a non-blank line, and the word can't fit and it should go into a new line
Something like this should work (note - I haven't tested this much outside of your solution. You'll definitely want to do that):
words.each do |word|
if line.blank?
# this is a new line, so start it with the current word
line << word
elsif word_can_fit_line?(line, word, length)
# the word fits, so append it to the current line
line << " #{word}"
else
# the word doesn't fit, so keep this line and start a new one with
# the current word
lines << line
line = word
end
end
# add the last line and we're done
lines << line
lines
Note that the implementation of word_can_fit_line? should be trivial - you just want to see if the current line length, plus a space, plus the word length, is less than or equal to your desired line length.

Related

Checking if a text file is formatted in a specific way

I have a text file which contains instructions. I'm reading it using File.readlines(filename). I want to check that the file is formatted as follows:
Has 3 lines
Line 1: two integers (including negatives) separated by a space
Line 2: two integers (including negatives) separated by a space and 1 capitalised letter of the alphabet also separated by a space.
Line 3: capitalised letters of the alphabet without any spaces (or punctuation).
This is what the file should look like:
8 10
1 2 E
MMLMRMMRRMML
So far I have calculated the number of lines using File.readlines(filename).length. How do I check the format of each line, do I need to loop through the file?
EDIT:
I solved the problem by creating three methods containing regular expressions, then I passed each line into it's function and created a conditional statement to check if the out put was true.
Suppose IO::read is used to return the following string str.
str = <<~END
8 10
1 2 E
MMLMRMMRRMML
END
#=> "8 10\n1 2 E\nMMLMRMMRRMML\n"
You can then test the string with a single regular expression:
r = /\A(-?\d+) \g<1>\n\g<1> \g<1> [A-Z]\n[A-Z]+\n\z/
str.match?(r)
#=> true
I could have written
r = /\A-?\d+ -?\d+\n-?\d+ -?\d+ [A-Z]\n[A-Z]+\n\z/
but matching an integer (-?\d+) is done three times. It's slightly shorter, and reduces the chance of error, to put the first of the three in capture group 1, and then treat that as a subexpression by calling it with \g<1> (not to be confused with a back-reference, which is written \k<1>). Alternatively, I could have use named capture groups:
r = /\A(?<int>-?\d+) \g<int>\n\g<int> \g<int> (?<cap>[A-Z])\n\g<cap>+\n\z/

Ruby regex to capture any string with a number

I am looking for a regular expression in Ruby to capture a sentence that has any sort of number in it.
For instance, I need to capture all of the following:
"5 different ways to do it"
"2 x 2 is certainly 4"
"there are 15 different things"
"Try to get to 10"
I only want to capture sentences with a number within, but that has nothing else before or after the number. I don't want to include things like:
"$2 billion dollars"
"The 5x effect"
It has to be just a sequence for 1 or more numbers at the beginning, middle, or end of a sentence.
Thanks.
You probably want something like:
/^.*(?<!\S)\d+(?!\S).*$/
Which will match a number and "look-around" for a non-space.
This
(s =~ /(^|\s)\d+(\s|$)/) ? s : nil
will return the string s if it contains at least one non-negative integer, that is:
the entire string,
at the beginning of the string followed by a whitespace character,
at the end the string preceded by a whitespace character, or
is both preceded and followed by a whitespace character.

Performing operations on each line of a string

I have a string named "string" that contains six lines.
I want to remove an "Z" from the end of each line (which each has) and capitalize the first character in each line (ignoring numbers and white space; e.g., "1. apple" -> "1. Apple").
I have some idea of how to do it, but have no idea how to do it in Ruby. How do I accomplish this? A loop? What would the syntax be?
Using regular expression (See String#gsub):
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
puts s.gsub(/z$/i, '').gsub(/^([^a-z]*)([a-z])/i) { $1 + $2.upcase }
# /z$/i - to match a trailing `z` at the end of lines.
# /^([^a-z]*)([a-z])/i - to match leading non-alphabets and alphabet.
# capture them as group 1 ($1), group 2 ($2)
output:
1. Apple
2. Banana
3. Cat
4. Dog
5. Elephant
6. Fruit
I would approach this by breaking your problem into smaller steps. After we've solved each of the smaller problems, you can put it all back together for a more elegant solution.
Given the initial string put forth by falsetru:
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
1. Break your string into an array of substrings, separated by the newline.
substrings = s.split(/\n/)
This uses the String class' split method and a regular expression. It searches for all occurrences of newline (backslash-n) and treats this as a delimiter, splitting the string into substrings based on this delimiter. Then it throws all of these substrings into an array, which we've named substrings.
2. Iterate through your array of substrings to do some stuff (details on what stuff later)
substrings.each do |substring|
.
# Do stuff to each substring
.
end
This is one form for how you iterate across an array in Ruby. You call the Array's each method, and you give it a block of code which it will run on each element in the array. In our example, we'll use the variable name substring within our block of code so that we can do stuff to each substring.
3. Remove the z character at the end of each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
end
Now, as we iterate through the array, the first thing we want to do is remove the z character at the end of each string. You do this with the gsub! method of String, which is a search-and-replace method. The first argument for this method is the regular expression of what you're looking for. In this case, we are looking for a z followed by the end-of-string (denoted by the dollar sign). The second argument is an empty string, because we want to replace what's been found with nothing (another way of saying - we just want to remove what's been found).
4. Find the index of the first letter in each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
end
The String class also has a method called index which will return the index of the first occurrence of a string that matches the regular expression your provide. In our case, since we want to ignore numbers and symbols and spaces, we are really just looking for the first occurrence of the very first letter in your substring. To do this, we use the regular expression /[a-zA-Z]/ - this basically says, "Find me anything in the range of small A to small Z or in big A to big Z." Now, we have an index (using our example strings, the index is 3).
5. Capitalize the letter at the index we have found
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
Based on the index value that we found, we want to replace the letter at that index with that same letter, but capitalized.
6. Put our substrings array back together as a single-string separated by newlines.
Now that we've done everything we need to do to each substring, our each iterator block ends, and we have what we need in the substrings array. To put the array back together as a single string, we use the join method of Array class.
result = substrings.join("\n")
With that, we now have a String called result, which should be what you're looking for.
Putting It All Together
Here is what the entire solution looks like, once we put together all of the steps:
substrings = s.split(/\n/)
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
result = substrings.join("\n")

Ruby: Remove whitespace chars at the beginning of a string

Edit: I solved this by using strip! to remove leading and trailing whitespaces as I show in this video. Then, I followed up by restoring the white space at the end of each string the array by iterating through and adding whitespace. This problem varies from the "dupe" as my intent is to keep the whitespace at the end. However, strip! will remove both the leading and trailing whitespace if that is your intent. (I would have made this an answer, but as this is incorrectly marked as a dupe, I could only edit my original question to include this.)
I have an array of words where I am trying to remove any whitespace that may exist at the beginning of the word instead of at the end. rstrip! just takes care of the end of a string. I want whitespaces removed from the beginning of a string.
example_array = ['peanut', ' butter', 'sammiches']
desired_output = ['peanut', 'butter', 'sammiches']
As you can see, not all elements in the array have the whitespace problem, so I can't just delete the first character as I would if all elements started with a single whitespace char.
Full code:
words = params[:word].gsub("\n", ",").delete("\r").split(",")
words.delete_if {|x| x == ""}
words.each do |e|
e.lstrip!
end
Sample text that a user may enter on the form:
Corn on the cob,
Fibonacci,
StackOverflow
Chat, Meta, About
Badges
Tags,,
Unanswered
Ask Question
String#lstrip (or String#lstrip!) is what you're after.
desired_output = example_array.map(&:lstrip)
More comments about your code:
delete_if {|x| x == ""} can be replaced with delete_if(&:empty?)
Except you want reject! because delete_if will only return a different array, rather than modify the existing one.
words.each {|e| e.lstrip!} can be replaced with words.each(&:lstrip!)
delete("\r") should be redundant if you're reading a windows-style text document on a Windows machine, or a Unix-style document on a Unix machine
split(",") can be replaced with split(", ") or split(/, */) (or /, ?/ if there should be at most one space)
So now it looks like:
words = params[:word].gsub("\n", ",").split(/, ?/)
words.reject!(&:empty?)
words.each(&:lstrip!)
I'd be able to give more advice if you had the sample text available.
Edit: Ok, here goes:
temp_array = text.split("\n").map do |line|
fields = line.split(/, */)
non_empty_fields = fields.reject(&:empty?)
end
temp_array.flatten(1)
The methods used are String#split, Enumerable#map, Enumerable#reject and Array#flatten.
Ruby also has libraries for parsing comma seperated files, but I think they're a little different between 1.8 and 1.9.
> ' string '.lstrip.chop
=> "string"
Strips both white spaces...

How to insert tag every 5 characters in a Ruby String?

I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys
s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string
There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character
Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')

Resources