I know I can easily remove a substring from a string.
Now I need to remove every substring from a string, if the substring is in an array.
arr = ["1. foo", "2. bar"]
string = "Only delete the 1. foo and the 2. bar"
# some awesome function
string = string.replace_if_in?(arr, '')
# desired output => "Only delete the and the"
All of the functions to remove adjust a string, such as sub, gsub, tr, ... only take one word as an argument, not an array. But my array has over 20 elements, so I need a better way than using sub 20 times.
Sadly it's not only about removing words, rather about removing the whole substring as 1. foo
How would I attempt this?
You can use gsub which accepts a regex, and combine it with Regexp.union:
string.gsub(Regexp.union(arr), '')
# => "Only delete the and the "
Like follows:
arr = ["1. foo", "2. bar"]
string = "Only delete the 1. foo and the 2. bar"
arr.each {|x| string.slice!(x) }
string # => "Only delete the and the "
One extended thing, this also allows you to crop text with regexp service chars like \, or . (Uri's answer also allows):
string = "Only delete the 1. foo and the 2. bar and \\...."
arr = ["1. foo", "2. bar", "\..."]
arr.each {|x| string.slice!(x) }
string # => "Only delete the and the and ."
Use #gsub with #join on the array elements
You can use #gsub by calling #join on the elements of the array, joining them with the regex alternation operator. For example:
arr = ["foo", "bar"]
string = "Only delete the foo and the bar"
string.gsub /#{arr.join ?|}/, ''
#=> "Only delete the and the "
You can then deal with the extra spaces left behind in any way you see fit. This is a better method when you want to censor words. For example:
string.gsub /#{arr.join ?|}/, '<bleep>'
#=> "Only delete the <bleep> and the <bleep>"
On the other hand, split/reject/join might be a better method chain if you need to care about whitespace. There's always more than one way to do something, and your mileage may vary.
Related
I have a big text file. Within this text file, I want to replace all mentions of the word 'pizza' with 'spinach', 'Pizza' with 'Spinach', and 'pizzing' with 'spinning' -- unless those words occur anywhere within curly braces. So {pizza}, {giant.pizza} and {hot-pizza-oven} should remain unchanged.
My best proposed solution so far is to iterate over the file line-by-line, issuing a regex that detects everything before an { or after an }, and using regexes on each of those strings. But that gets really complex and unwieldy and I want to know if there's a proper solution for this problem.
This can be done in a few steps. I'd iterate through the file line by line, and pass each line to this method:
def spinachize line
# list of words to swap
swaps = {
'pizza' => 'spinach',
'Pizza' => 'Spinach',
'pizzing' => 'spinning'
}
# random placeholder for bracketed text
placeholder = 'fdjfafdlskdsfajkldfas'
# save all instances of bracketed text
bracketed_text = line.scan(/\{.*?\}/)
# remove bracketed text from line
line.gsub!(/\{.*?\}/, placeholder)
# replace all swaps
swaps.each do |original_text, new_text|
line.gsub!(original_text, new_text)
end
# re-insert bracketed text
line.gsub(placeholder){bracketed_text.shift}
end
The comments above explain things as we go. Here are a couple of examples:
spinachize "Pizza is good, but more pizza is better"
=> "Spinach is good, but more spinach is better"
spinachize "Leave bracketed instances of {pizza} or {this.pizza} alone"
=> "Leave bracketed instances of {pizza} or {this.pizza} alone"
As you can see, you can specify the items you want swapped, or modify the method to pull the list from a database or flat file somewhere. The placeholder just needs to be something unique that wouldn't come up in the source file naturally.
The process is this: remove bracketed text from the original line, and remember it for later. Swap all text that needs swapping, then add back the bracketed text. It's not a one-liner, but it works well and is readable and easy to update.
The last line of the method might need some clarification. Not many people know that the "gsub" method can take a block instead of a second parameter. That block then determines what gets put in place of the original text. In this case, every time the block is called I remove the first item off our saved bracket list, and use that.
rules = {'pizza' => 'spinach','Pizza' => 'Spinach','pizzing' => 'spinning'}
regexp = /\{[^{}]*\}|#{rules.keys.join('|')}/m
puts(file.read.gsub(regexp) { |s| rules[s] || s })
This constructs a regular expression that matches either bracketed strings or the strings to replace. We then run it through a block that replaces strings with the given value, and will leave bracketed strings unchanged. With the /m flag, the regular expression can tolerate newlines inside the brackets--if that won't happen, you can take it out. Either way, no need to iterate line by line.
str = "Pizza {pizza} with spinach is not pizzing."
swaps = {'{pizza}' =>'{pizza}',
'{Pizza}' =>'{Pizza}',
'{pizzing}'=> '{pizzing}'
'pizza' => 'spinach',
'Pizza' => 'Spinach',
'pizzing' => 'spinning'}
regex = Regexp.union(swaps.keys)
p str.gsub(regex, swaps) # => "Spinach {pizza} with spinach is not spinning."
I would call the following method for each line of the file.
Code
def doit(line)
replace = {'pizza'=>'spinach', 'Pizza'=>'Spinach', 'pizzing'=>'spinning'}
r = /\{.*?\}/
arr= line.split(r).map { |str|
str.gsub(/\b(?:pizza|Pizza|pizzing)\b/, replace) }
line.scan(r).each_with_object(arr.shift) { |str,res|
res << str << arr.shift }
end
Examples
doit("Pizza Primastrada's {pizza} is the best {pizzing} pizza in town.")
#=> "Spinach Primastrada's {pizza} is the best {pizzing} spinach in town."
doit("{Pizza Primastrada}'s pizza is the best pizzing {pizza} in town.")
#=> "{Pizza Primastrada}'s spinach is the best spinning {pizza} in town."
Explanation
line = "Pizza Primastrada's {pizza} is the best {pizzing} pizza in town."
replace = {'pizza'=>'spinach', 'Pizza'=>'Spinach', 'pizzing'=>'spinning'}
r = /\{.*?\}/
a = line.split(r)
#=> ["Pizza Primastrada's ", " is the best ", " pizza in town."]
b = a.map { |str| str.gsub(/\b(?:pizza|Pizza|pizzing)\b/, replace) }
#=> ["Spinach Primastrada's ", " is the best ", " spinach in town."]
keepers = line.scan(r)
#=> ["{pizza}", "{pizzing}"]
keepers.each_with_object(b.shift) { |str,res| res << str << b.shift }
#=> "Spinach Primastrada's {pizza} is the best {pizzing} spinach in town."
Nested braces
If you wish to permit nested braces, change the regex to:
r = /\{[^{}]*?(?:\{.*?\})*?[^{}]*?\}/
doit("Pizza Primastrada's {{great {great} pizza} is the best pizza.")
#=> "Spinach Primastrada's {{great {great} pizza} is the best spinach."
You referred to the string
{words,salad,#{1,2,3} pizza|}
in a comment. If that is part of a string enclosed in single quotes, not a problem. If enclosed in double quotes, however, # will raise a syntax error. Again, no problem, if the pound character is escaped (\#).
I would like to pass a sequence of characters into a function as a string and have it return to me that string split at the following characters:
# # $ % ^ & *
such that if the string is
'hey#man^you*are#awesome'
the program returns
'hey man you are awesome'
How can I do this?
To split the string you can use String#split
'hey#man^you*are#awesome'.split(/[##$%^&*]/)
#=> ["hey", "man", "you", "are", "awesome"]
to bring it back together, you can use Array#join
'hey#man^you*are#awesome'.split(/[##$%^&*]/).join(' ')
#=> "hey man you are awesome"
split and join should be self-explanatory. The interesting part is the regular expression /[##$%^&*]/ which matches any of the characters inside the character class [...]. The above code is essentially equivalent to
'hey#man^you*are#awesome'.gsub(/[##$%^&*]/, ' ')
#=> "hey man you are awesome"
where the gsub means "globally substitute any occurence of ##$%^&* with a space".
You could also use String#tr, which avoids the need to convert an array back to a string:
'hey#man^you*are#awesome'.tr('##$%^&*', ' ')
#=> "hey man you are awesome"
If I wanted to remove things like:
.!,'"^-# from an array of strings, how would I go about this while retaining all alphabetical and numeric characters.
Allowed alphabetical characters should also include letters with diacritical marks including à or ç.
You should use a regex with the correct character property. In this case, you can invert the Alnum class (Alphabetic and numeric character):
"◊¡ Marc-André !◊".gsub(/\p{^Alnum}/, '') # => "MarcAndré"
For more complex cases, say you wanted also punctuation, you can also build a set of acceptable characters like:
"◊¡ Marc-André !◊".gsub(/[^\p{Alnum}\p{Punct}]/, '') # => "¡MarcAndré!"
For all character properties, you can refer to the doc.
string.gsub(/[^[:alnum:]]/, "")
The following will work for an array:
z = ['asfdå', 'b12398!', 'c98347']
z.each { |s| s.gsub! /[^[:alnum:]]/, '' }
puts z.inspect
I borrowed Jeremy's suggested regex.
You might consider a regular expression.
http://www.regular-expressions.info/ruby.html
I'm assuming that you're using ruby since you tagged that in your post. You could go through the array, put it through a test using a regexp, and if it passes remove/keep it based on the regexp you use.
A regexp you might use might go something like this:
[^.!,^-#]
That will tell you if its not one of the characters inside the brackets. However, I suggest that you look up regular expressions, you might find a better solution once you know their syntax and usage.
If you truly have an array (as you state) and it is an array of strings (I'm guessing), e.g.
foo = [ "hello", "42 cats!", "yöwza" ]
then I can imagine that you either want to update each string in the array with a new value, or that you want a modified array that only contains certain strings.
If the former (you want to 'clean' every string the array) you could do one of the following:
foo.each{ |s| s.gsub! /\p{^Alnum}/, '' } # Change every string in place…
bar = foo.map{ |s| s.gsub /\p{^Alnum}/, '' } # …or make an array of new strings
#=> [ "hello", "42cats", "yöwza" ]
If the latter (you want to select a subset of the strings where each matches your criteria of holding only alphanumerics) you could use one of these:
# Select only those strings that contain ONLY alphanumerics
bar = foo.select{ |s| s =~ /\A\p{Alnum}+\z/ }
#=> [ "hello", "yöwza" ]
# Shorthand method for the same thing
bar = foo.grep /\A\p{Alnum}+\z/
#=> [ "hello", "yöwza" ]
In Ruby, regular expressions of the form /\A………\z/ require the entire string to match, as \A anchors the regular expression to the start of the string and \z anchors to the end.
Currently i am splitting a string by pattern, like this:
outcome_array=the_text.split(pattern_to_split_by)
The problem is that the pattern itself that i split by, always gets omitted.
How do i get it to include the split pattern itself?
Thanks to Mark Wilkins for inpsiration, but here's a shorter bit of code for doing it:
irb(main):015:0> s = "split on the word on okay?"
=> "split on the word on okay?"
irb(main):016:0> b=[]; s.split(/(on)/).each_slice(2) { |s| b << s.join }; b
=> ["split on", " the word on", " okay?"]
or:
s.split(/(on)/).each_slice(2).map(&:join)
See below the fold for an explanation.
Here's how this works. First, we split on "on", but wrap it in parentheses to make it into a match group. When there's a match group in the regular expression passed to split, Ruby will include that group in the output:
s.split(/(on)/)
# => ["split", "on", "the word", "on", "okay?"
Now we want to join each instance of "on" with the preceding string. each_slice(2) helps by passing two elements at a time to its block. Let's just invoke each_slice(2) to see what results. Since each_slice, when invoked without a block, will return an enumerator, we'll apply to_a to the Enumerator so we can see what the Enumerator will enumerator over:
s.split(/(on)/).each_slice(2).to_a
# => [["split", "on"], ["the word", "on"], ["okay?"]]
We're getting close. Now all we have to do is join the words together. And that gets us to the full solution above. I'll unwrap it into individual lines to make it easier to follow:
b = []
s.split(/(on)/).each_slice(2) do |s|
b << s.join
end
b
# => ["split on", "the word on" "okay?"]
But there's a nifty way to eliminate the temporary b and shorten the code considerably:
s.split(/(on)/).each_slice(2).map do |a|
a.join
end
map passes each element of its input array to the block; the result of the block becomes the new element at that position in the output array. In MRI >= 1.8.7, you can shorten it even more, to the equivalent:
s.split(/(on)/).each_slice(2).map(&:join)
You could use a regular expression assertion to locate the split point without consuming any of the input. Below uses a positive look-behind assertion to split just after 'on':
s = "split on the word on okay?"
s.split(/(?<=on)/)
=> ["split on", " the word on", " okay?"]
Or a positive look-ahead to split just before 'on':
s = "split on the word on okay?"
s.split(/(?=on)/)
=> ["split ", "on the word ", "on okay?"]
With something like this, you might want to make sure 'on' was not part of a larger word (like 'assertion'), and also remove whitespace at the split:
"don't split on assertion".split(/(?<=\bon\b)\s*/)
=> ["don't split on", "assertion"]
If you use a pattern with groups, it will return the pattern in the results as well:
irb(main):007:0> "split it here and here okay".split(/ (here) /)
=> ["split it", "here", "and", "here", "okay"]
Edit The additional information indicated that the goal is to include the item on which it was split with one of the halves of the split items. I would think there is a simple way to do that, but I don't know it and haven't had time today to play with it. So in the absence of the clever solution, the following is one way to brute force it. Use the split method as described above to include the split items in the array. Then iterate through the array and combine every second entry (which by definition is the split value) with the previous entry.
s = "split on the word on and include on with previous"
a = s.split(/(on)/)
# iterate through and combine adjacent items together and store
# results in a second array
b = []
a.each_index{ |i|
b << a[i] if i.even?
b[b.length - 1] += a[i] if i.odd?
}
print b
Results in this:
["split on", " the word on", " and include on", " with previous"]
I want to do a sequence of gsubs against one string, so I utilized the fact that gsub can take a hash as the second argument. One thing I wanted to do with gsub is to convert a sequence of one or more space/tab into a single space, so I have something essentially as follows:
gsub(/[ \t]+/, {/[ \t]+/ => ' '})
In my actual code, the first argument is a union of the regexp I gave here, and the second argument includes more key-value pairs.
Now, when I apply this to a string, all of the space/tabs are deleted. I suppose this is because the match to the first argument is not regarded as matching to the key [ \t] in the second argument (hash). Does the match in the second argument hash only looks for exact string match, not regexp match? If so, is there any way to get around it?
This is a related question. If you need to use the hash because many things have to be substituted, this might work:
list = Hash.new{|h,k|if /\s+/ =~ k then ' ' else k end}
list['foo'] = 'bar'
list['apple'] = 'banana'
p "appleabc\t \tabc apple foo".gsub(/\w+|\W+/,list)
#=> "appleabc abc banana bar"
p list
#=>{"foo"=>"bar", "apple"=>"banana"} no garbage
According to the docs, gsub with a hash as the second parameter only matches against literal strings:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
If you want to supply multiple hashes you could work around it by creating a hash, where the key/value pairs are the search => replacement pairs, iterate over the hash, and pass those into the gsub. Because Ruby 1.9+ maintains the insertion order of the hash, you're guaranteed that the search will occur in the order you want.
search_hash = {
'1' => 'one',
'too' => 'two',
/[\t ]+/ => ' '
}
str = "1, too,\t3 , four"
search_hash.each { |n,v| str.gsub!(n, v) }
str #=> "one, two, 3 , four"
If you just want the spaces/tabs replaced with one space, why not just specify that as the replacement, and omit the whole hash?
gsub(/[ \t]+/, ' ')
UPDATE: based on your comment, you can use the block syntax of gsub
gsub(/[ \t]+/) {|match| *do stuff here* }