what is the elegant way to replace string without gsub? - ruby

I have some dynamic strings, which have an X character. X can appear continuously or scattered though the string. I want to replace those X with #.
For example, abXXcX12XX. I want ab#c#12#. That means multiple contiguous X have to be replaced by only one # and if only one X, then also by a single #.
I tried:
s = "aXX123Xc56XXX"
s.squeeze('X').gsub('X','#') # => "a#123#c56#"
Any elegant way or direct approach to do the same operation ?

I will do using String#tr_s as below :
Processes a copy of str as described under String#tr, then removes duplicate characters in regions that were affected by the translation.
s = "aXX123Xc56XXX"
s.tr_s('X','#') # => "a#123#c56#"

Not sure why you wouldn't use gsub here?
Regexes seem to work pretty well:
"aXX123Xc56XXX".gsub(/X+/, "#")
=> "a#123#c56#"
The reason this works is that /X+/ will match one or more of the X character, so multiple X in a row will generate only one match and be replaced by one #.

pry(main)> "aXX123Xc56XXX".gsub(/X+/, "#")
=> "a#123#c56#"

While both gsub and tr_s will accomplish the task, here's the compelling reason to use tr_s:
require 'fruity'
STRING = 'aXX123Xc56XXX' * 1000
compare do
using_tr_s { STRING.tr_s('X', '#') }
using_gsub { STRING.gsub(/X+/, '#') }
end
Which, after running on my laptop, results in:
Running each test 16 times. Test will take about 1 second.
using_tr_s is faster than using_gsub by 5x ± 0.1
The regular expression engine has seen a lot of speedups, but it'll still be beat for non-anchored lookups. If there was a way to tell it to start at the start or end of the string we'd see it speed up greatly, and, for some search/replace actions, I've seen it outrun everything else. The pattern used is critical; Poorly written ones cripple the engine, so be careful what you use.

Related

Count the number of sentences in a paragraph using Ruby

I have gotten to the point where I can split and count sentences with simple end of sentence punctuation like ! ? .
However, I need it to work for complex sentences such as:
"Learning Ruby is a great endeavor!!!! Well, it can be difficult at times..."
Here you can see the punctuation repeats itself.
What I have so far, that works with simple sentences:
def count_sentences
sentence_array = self.split(/[.?!]/)
return sentence_array.count
end
Thank you!
It's pretty easy to adapt your code to be a little more forgiving:
def count_sentences
self.split(/[.?!]+/).count
end
There's no need for the intermediate variable or return.
Note that empty strings will also be caught up in this, so you may want to filter those out:
test = "This is junk! There's a space at the end! "
That would return 3 with your code. Here's a fix for that:
def count_sentences
self.split(/[.?!]+/).grep(/\S/).count
end
That will select only those strings that have at least one non-space character.
class String
def count_sentences
scan(/[.!?]+(?=\s|\z)/).size
end
end
str = "Learning Ruby is great!!!! The course cost $2.43... How much??!"
str.count_sentences
#=> 3
(?=\s|\z)/) is a positive lookahead, requiring the match to be immediately followed by a whitespace character or the end of the string.
String#count might be easiest.
"Who will treat me to a beer? I bet, alexnewby will!".count('.!?')
Compared to tadman's solution, no intermediate array needs to be constructed. However it yields incorrect results if, for instance, a run of periods or exclamation mark is found in the string:
"Now thinking .... Ah, that's it! This is what we have to do!!!".count('.!?')
=> 8
The question therefore is: Do you need absolute, exact results, or just approximate ones (which might be sufficient, if this is used for statistical analysis of, say, large printed texts)? If you need exact results, you need to define, what is a sentence, and what is not. Think about the following text - how many sentences are in it?
Louise jumped out of the ground floor window.
"Stop! Don't run away!", cried Andy. "I did not
want to eat your chocolate; you have to believe
me!" - and, after thinking for a moment, he
added: "If you come back, I'll buy you a new
one! Large one! With hazelnuts!".
BTW, even tadman's solution is not exact. It would give a count of five for the following single sentence:
The IP address of Mr. Sloopsteen's dishwasher is 192.168.101.108!

Selecting key words in a string (that are included in an Array) to change their format in Ruby

Select key words in a string to change their format in Ruby
I have a big string (text) and an Array of strings (key_words) as below:
text = 'So in this election, we cannot sit back and hope that everything works out for the best. We cannot afford to be tired or frustrated or cynical. No, hear me. Between now and November, we need to do what we did eight years ago and four years ago…'
key_words = ['frustrated', 'tired', 'hope']
My objective is to print each word in ‘text’ while changing the colour and case of the words that are included in key_words. I’ve been able to do that by doing:
require 'colorize'
text.split(/\b/).each do |x|
if key_words.include?(x.downcase) ; print '#{x}'.colorize(:red)
else print '#{x}' end
end
However, since I don’t want to include many words in key_words I want to make the selection more sensitive going beyond an exact match. Such as if, for example:
key_words = ['frustrat', 'tire', 'hope'] => the algorithm would select both 'Frustration', 'Frustrated' or 'Tiring' and 'Tired' or 'Hope' and 'Hopeful'.
I’ve tried playing with word lengths in both the string and the array as below but it’s seems very inefficient solution and I’m getting very confused with the usage of .any? and .include? methods in this scenario.
key_words = ['frustrated', 'tired', 'hope']
key_words_abb = []
key_words.each { |x| key_words_abb << x.downcase[0][0..x.length-2]}
text.split(/\b/).each do |x|
if key_words_abb.include?(x.downcase[0][0..x.length-2]); print '#{x}'.colorize(:red)
else print x
end
end
Since I can’t find a specific solution online I would appreciate your help.
It's worth noting that when doing repeated substitutions on strings, especially longer ones, you'll want your substitution method to be as efficient as possible. Spinning through an array of things to switch out is painfully expensive, especially as that list grows.
Here's a variation on your approach:
replacement = Regexp.new('\b%s\b' % [ Regexp.union(key_words) ])
replaced = text.gsub(replacement) do |s|
s.colorize(:red)
end
puts replaced
If you're using that substitution repeatedly you should persist the Regexp object into a constant. That avoids having to compile it for each string you're adjusting. If the list changes based on factors hard to predict, leave it like this and produce it dynamically.
One thing to note about using Ruby is it's often best to express your code as a series of transformations with output as a final step. Putting things like print in the middle of a loop complicates things unnecessarily. If you want to add an additional step to your loop you have to do a lot of extra work to move that print to a later stage. With the approach here you can just chain on the end and do whatever you want.

Good style for splitting lengthy expressions over lines

If the following is not the best style, what is for the equivalent expression?
if (some_really_long_expression__________ && \
some_other_really_long_expression)
The line continuation feels ugly. But I'm having a hard time finding a better alternative.
The parser doesn't need the backslashes in cases where the continuation is unambiguous. For example, using Ruby 2.0:
if true &&
true &&
true
puts true
end
#=> true
The following are some more-or-less random thoughts about the question of line length from someone who just plays with Ruby. Nor have I had any training as a software engineer, so consider yourself forewarned.
I find the problem of long lines is often more the number of characters than the number of operations. The former can be reduced by (drum-roll) shortening variable names and method names. The question, of course, is whether the application of a verbosity filter (aka babbling, prattling or jabbering filter) will make the code harder to comprehend. How often have you seen something fairly close to the following (without \)?
total_cuteness_rating = cats_dogs_and_pigs.map {|animal| \
cuteness_calculation(animal)}.reduce {|cuteness_accumulator, \
cuteness_per_animal| cuteness_accumulator + cuteness_per_animal}
Compare that with:
tot_cuteness = pets.map {|a| cuteness(a)}.reduce(&:+)
Firstly, I see no benefit of long names for local variables within a block (and rarely for local variables in a method). Here, isn't it perfectly obvious what a refers to in the calculation of tot_cuteness? How good a memory do you need to remember what a is when it is confined to a single line of code?
Secondly, whenever possible use the short form for enumerables followed by a block (e.g, reduce(&:+)). This allows us to comprehend what's going on in microseconds, here as soon as our eyes latch onto the +. Same, for .to_i, _s or _f. True, reduce {|tot, e| tot + e} isn't much longer, but we're forcing the reader's brain to decode two variables as well as the operator, when + is really all it needs.
Another way to shorten lines is to avoid long chains of operations. That comes at a cost, however. As far as I'm concerned, the longer the chain, the better. It reduces the need for temporary variables, reduces the number of lines of code and--possibly of greatest importance--allows us to read across a line, as most humans are accustomed, rather than down the page. The above line of code reads, "To calculate total cuteness, calculate each pet's cuteness rating, then sum those ratings". How could it be more clear?
When chains are particularly long, they can be written over multiple lines without using the line-continuaton character \:
array.each {|e| blah, blah, ..., blah
.map {|a| blah, blah, ..., blah
.reduce {|i| blah, blah, ..., blah }
}
}
That's no less clear than separate statements. I think this is frequently done in Rails.
What about the use of abbreviations? Which of the following names is most clear?
number_of_dogs
number_dogs
nbr_dogs
n_dogs
I would argue the first three are equally clear, and the last no less clear if the writer consistently prefixes variable names with n_ when that means "number of". Same for tot_, and so on. Enough.
One approach is to encapsulate those expressions inside meaningful methods. And you might be able to break it into multiple methods that you can later reuse.
Other then that is hard to suggest anything with the little information you gave. You might be able to get rid of the if statement using command objects or something like that but I can't tell if it makes sense on your code because you didn't show it.
Ismael answer works really well in Ruby (there may be other languages too) for 2 reasons:
Ruby has very low overhead to creating methods due to lack of type
definition
It allows you to decouple such logic for reuse or future adaptability and testing
Another option I'll toss out is create logic equations and store the result in a variable e.g.
# this are short logic equations testing x but you can apply same for longer expressions
number_gt_5 = x > 5
number_lt_20 = x < 20
number_eq_11 = x == 11
if (number_gt_5 && number_lt_20 && !number_eq_11)
# do some stuff
end

Perform operations within string format

I am sure that this has been asked, but I can't find it through my rudimentary searches.
Is it discouraged to perform operations within string initializations?
> increment = 4
=> 4
> "Incremented from #{increment} to #{increment += 1}"
=> "Incremented from 4 to 5"
I sure wouldn't, because that's not where you look for things-that-change-things when reading code.
It obfuscates intent, it obscures meaning.
Compare:
url = "#{BASE_URL}/#{++request_sequence}"
with:
request_sequence += 1
url = "#{BASE_URL}/#{request_sequence}"
If you're looking to see where the sequence number is coming from, which is more obvious?
I can almost live with the first version, but I'd be likely to opt for the latter. I might also do this instead:
url = build_url(++request_sequence)
In your particular case, it might be okay, but the problem is that the position where the operation on the variable should happen must be the last instance of the same variable in the string, and you cannot always be sure about that. For example, suppose (for some stylistic reason), you want to write
"Incremented to #{...} from #{...}"
Then, all of a sudden, you cannot do what you did. So doing operation during interpolation is highly dependent on the particular phrasing in the string, and that decreases maintainability of the code.

Unused regex captures in Ruby

I have a script that processes the contents of a file from a CAD program, for use in another CAD program. Can the unused variables in the block be skipped, or written around? The script works fine with them in place, I was just curious if there was a cleaner way to write it. Thank you.
string = IO.read("file.txt")
string.scan(/regex/m) {|a,b,c,d,e,f,g|
# captures 7 items, I use 1-4, & 6 below, skipping 5 & 7
print a, b+".ext", c.to_f/25400000, d.to_f/25400000, f,"\n"
}
My question lies in the last line - if I'm not using them all - do I still have to declare them all, for it to work properly, and remain in the correct order?
Elements 5 & 7 may be used at a later time, but for now, they are just part of the regex, for future flexibility.
Since you are getting the variables as block variables, you cannot skip the order. The problem is with your regex. If you have a group that you don't want to capture, you should use the uncapturing group (?: ) instead of the capturing group ( ). So change the fifth and the seventh ( ) in your regex to (?: ). If you are using ruby 1.9 or are using oniguruma regex engine on ruby 1.8.7, then you can also use named captures; for example use (?<foo> ) in the regex, and refer to the captured string in the block as foo or $~[:foo].
You could use an array instead of an explicit list of variables and then pick things out of the array by index:
string.scan(/regex/m) { |a|
print a[0], a[1] + ".ext", a[2].to_f / 25400000, a[3].to_f / 25400000, a[5], "\n"
}
Either that or rework your regular expression to only capture what you need.
You can use the same variable multiple times in the list so just renaming the things you're not using to unused would probably be the simplest choice:
string.scan(/regex/m) { |a, b, c, d, unused, f, unused|
print a, b + ".ext", c.to_f / 25400000, d.to_f / 25400000, f, "\n"
}
At least this way it is (or should be) obvious that you're not using the fifth and seventh captures. However, this doesn't work in 1.9 so you'd have to use unused1 and unused2 in 1.9.
An ideal balance would be to use 1.9's named capture groups but scan doesn't give you access to them.

Resources