Remove all non-alphabetical, non-numerical characters from a string? - ruby

If I wanted to remove things like:
.!,'"^-# from an array of strings, how would I go about this while retaining all alphabetical and numeric characters.
Allowed alphabetical characters should also include letters with diacritical marks including à or ç.

You should use a regex with the correct character property. In this case, you can invert the Alnum class (Alphabetic and numeric character):
"◊¡ Marc-André !◊".gsub(/\p{^Alnum}/, '') # => "MarcAndré"
For more complex cases, say you wanted also punctuation, you can also build a set of acceptable characters like:
"◊¡ Marc-André !◊".gsub(/[^\p{Alnum}\p{Punct}]/, '') # => "¡MarcAndré!"
For all character properties, you can refer to the doc.

string.gsub(/[^[:alnum:]]/, "")

The following will work for an array:
z = ['asfdå', 'b12398!', 'c98347']
z.each { |s| s.gsub! /[^[:alnum:]]/, '' }
puts z.inspect
I borrowed Jeremy's suggested regex.

You might consider a regular expression.
http://www.regular-expressions.info/ruby.html
I'm assuming that you're using ruby since you tagged that in your post. You could go through the array, put it through a test using a regexp, and if it passes remove/keep it based on the regexp you use.
A regexp you might use might go something like this:
[^.!,^-#]
That will tell you if its not one of the characters inside the brackets. However, I suggest that you look up regular expressions, you might find a better solution once you know their syntax and usage.

If you truly have an array (as you state) and it is an array of strings (I'm guessing), e.g.
foo = [ "hello", "42 cats!", "yöwza" ]
then I can imagine that you either want to update each string in the array with a new value, or that you want a modified array that only contains certain strings.
If the former (you want to 'clean' every string the array) you could do one of the following:
foo.each{ |s| s.gsub! /\p{^Alnum}/, '' } # Change every string in place…
bar = foo.map{ |s| s.gsub /\p{^Alnum}/, '' } # …or make an array of new strings
#=> [ "hello", "42cats", "yöwza" ]
If the latter (you want to select a subset of the strings where each matches your criteria of holding only alphanumerics) you could use one of these:
# Select only those strings that contain ONLY alphanumerics
bar = foo.select{ |s| s =~ /\A\p{Alnum}+\z/ }
#=> [ "hello", "yöwza" ]
# Shorthand method for the same thing
bar = foo.grep /\A\p{Alnum}+\z/
#=> [ "hello", "yöwza" ]
In Ruby, regular expressions of the form /\A………\z/ require the entire string to match, as \A anchors the regular expression to the start of the string and \z anchors to the end.

Related

How to remove strings that end with a particular character in Ruby

Based on "How to Delete Strings that Start with Certain Characters in Ruby", I know that the way to remove a string that starts with the character "#" is:
email = email.gsub( /(?:\s|^)#.*/ , "") #removes strings that start with "#"
I want to also remove strings that end in ".". Inspired by "Difference between \A \z and ^ $ in Ruby regular expressions" I came up with:
email = email.gsub( /(?:\s|$).*\./ , "")
Basically I used gsub to remove the dollar sign for the carrot and reversed the order of the part after the closing parentheses (making sure to escape the period). However, it is not doing the trick.
An example I'd like to match and remove is:
"a8&23q2aas."
You were so close.
email = email.gsub( /.*\.\s*$/ , "")
The difference lies in the fact that you didn't consider the relationship between string of reference and the regex tokens that describe the condition you wish to trigger. Here, you are trying to find a period (\.) which is followed only by whitespace (\s) or the end of the line ($). I would read the regex above as "Any characters of any length followed by a period, followed by any amount of whitespace, followed by the end of the line."
As commenters pointed out, though, there's a simpler way: String#end_with?.
I'd use:
words = %w[#a day in the life.]
# => ["#a", "day", "in", "the", "life."]
words.reject { |w| w.start_with?('#') || w.end_with?('.') }
# => ["day", "in", "the"]
Using a regex is overkill for this if you're only concerned with the starting or ending character, and, in fact, regular expressions will slow your code in comparison with using the built-in methods.
I would really like to stick to using gsub....
gsub is the wrong way to remove an element from an array. It could be used to turn the string into an empty string, but that won't remove that element from the array.
def replace_suffix(str,suffix)
str.end_with?(suffix)? str[0, str.length - suffix.length] : str
end

How in ruby delete all non-digits symbols (except commas and dashes)

I meet some hard task for me. I has a string which need to parse into array and some other elements. I have a troubles with REGEXP so wanna ask help.
I need delete from string all non-digits, except commas (,) and dashes (-)
For example:
"!1,2e,3,6..-10" => "1,2,3,6-10"
"ffff5-10...." => "5-10"
"1.2,15" => "12,15"
and so.
[^0-9,-]+
This should do it for you.Replace by empty string.See demo.
https://regex101.com/r/vV1wW6/44
We must have at least one non-regex solution:
def keep_some(str, keepers)
str.delete(str.delete(keepers))
end
keep_some("!1,2e,3,6..-10", "0123456789,-")
#=> "1,2,3,6-10"
keep_some("ffff5-10....", "0123456789,-")
#=> "5-10"
keep_some("1.2,15", "0123456789,-")
#=> "12,15"
"!1,2e,3,6..-10".gsub(/[^\d,-]+/, '') # => "1,2,3,6-10"
Use String#gsub with a pattern that matches everything except what you want to keep, and replace it with the empty string. In a reguar expression, the negated character class [^whatever] matches everything except the characters in the "whatever", so this works:
a_string.gsub /[^0-9,-]/, ''
Note that the hyphen has to come last, as otherwise it will be interpreted as a range indicator.
To demonstrate, I put all your "before" strings into an Array and used Enumerable#map to run the above gsub call on all of them, producing an Array of the "after" strings:
["!1,2e,3,6..-10", "ffff5-10....", "1.2,15"].map { |s| s.gsub /[^0-9,-]/, '' }
# => ["1,2,3,6-10", "5-10", "12,15"]

Drop elements from array if regexp does not match

Is there a way to do this?
I have an array:
["file_1.jar", "file_2.jar","file_3.pom"]
And I want to keep only "file_3.pom", what I want to do is something like this:
array.drop_while{|f| /.pom/.match(f)}
But This way I keep everything in array but "file_3.pom" is there a way to do something like "not_match"?
I found these:
f !~ /.pom/ # => leaves all elements in array
OR
f !~ /*.pom/ # => leaves all elements in array
But none of those returns what I expect.
How about select?
selected = array.select { |f| /.pom/.match(f) }
p selected
# => ["file_3.pom"]
Hope that helps!
In your case you can use the Enumerable#grep method to get an array of the elements that matches a pattern:
["file_1.jar", "file_2.jar", "file_3.pom"].grep(/\.pom\z/)
# => ["file_3.pom"]
As you can see I've also slightly modified your regular expression to actually match only strings that ends with .pom:
\. matches a literal dot, without the \ it matches any character
\z anchor the pattern to the end of the string, without it the pattern would match .pom everywhere in the string.
Since you are searching for a literal string you can also avoid regular expression altogether, for example using the methods String#end_with? and Array#select:
["file_1.jar", "file_2.jar", "file_3.pom"].select { |s| s.end_with?('.pom') }
# => ["file_3.pom"]
If you whant to keep only Strings witch responds on regexp so you can use Ruby method keep_if.
But this methods "destroy" main Array.
a = ["file_1.jar", "file_2.jar","file_3.pom"]
a.keep_if{|file_name| /.pom/.match(file_name)}
p a
# => ["file_3.pom"]

How do I write a regular expression that will match characters in any order?

I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:
str = "act"
str.scan(/Insert expression here/)
would match:
cat
act
tca
atc
tac
cta
but would not match ca, ac or cata.
I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.
To clarify a bit, I'm using ruby and do not want to allow repeat characters.
Here is your solution
^(?:([act])(?!.*\1)){3}$
See it here on Regexr
^ # matches the start of the string
(?: # open a non capturing group
([act]) # The characters that are allowed and a capturing group
(?!.*\1) # That character is matched only if it does not occur once more, Lookahead assertion
){3} # Defines the amount of characters
$
The only special think is the lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.
Edit: as mentioned by #georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:
^(?=[act]{3}$)(?!.*(.).*\1).*$
That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.
Here's how I'd go about it:
regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/
%w[
cat act tca atc tac cta
ca ac cata
].each do |w|
puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end
That outputs:
"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match
I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:
puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)
puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b
Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.
If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:
use List::Permutor;
use Regexp::Assemble;
my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my #set = $perm->next) {
$regex_assembler->add(join('', #set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.
I will assume several things here:
- You are looking for permutations of given characters
- You are using ruby
str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}
# and for the actual test
permutations.include?("cat")
It is no regex though.
No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like #scones suggested.
"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]
And if you really need a regex out of it for scanning a larger string, you can always:
Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/
Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.
EDIT: Added word boundaries for false positive on cata based on #theTinMan's feedback.

Eloquent way to format string in Ruby?

str = 'foo_bar baz __goo'
Should print as
Foo Bar Baz Goo
Tried to use split /(\s|_)/, but it returns '_' and ' ' and multiple spaces...?
Ruby 1.9.3
try this:
str.split(/[\s_]+/).map(&:classify).join(" ")
if you have access to active support, or
str.split(/[\s_]+/).map(&:capitalize).join(" ")
if you want plain ruby.
Assuming you mean /(\s|_)/ (direction of the slashes matters!), your regular expression is pretty close. The reason you're getting the delimiters (spaces and underscores) in your result is the parentheses: they instruct the splitter to include the delimiters in the returned array.
The reason you are getting extra empty strings is that you are splitting on \s, which matches exactly one space (or tab), or '_', which matches exactly one underscore. If you want to treat any number of spaces or underscores as a single delimiter, you need to add + to your regex - it means "one or more of the previous thing".
But \s|_+ means "a space, or one or more underscores". You want to apply the + to the whole expression, not just the _. That brings us back to the parentheses. In this case, you want to group the two alternatives together without capturing (and returning) them; the syntax for that is (?:...). So this is the result:
str.split(/(?:\s|_)+/)
Now, if you want to normalize case, you want to run capitalize on each string, which you can do with map, like this:
str.split(/(?:\s|_)+/).map { |s| s.capitalize }
or use the shortcut:
str.split(/(?:\s|_)+/).map(&:capitalize)
So far, all these solutions return an array of strings, which you can do a variety of things with. But if you just want to put them back together into a single string, you can use join. For instance, to put them together with a single space between them:
str.split(/(?:\s|_)+/).map(&:capitalize).join ' '
Try splitting the string on:
[\s_]+
Use /[\ _]+/, it looks for one or more occurrence or space or underscore. In that way it is able to eat out multiple underscores, spaces or a combination of both. After than you get an array, so you use map to transform each of them. Later you can get them together using join. See examples -
Get them in a list -
str.split(/[\ _]+/).map {|s| s.capitalize }
=> ["Foo", "Bar", "Baz", "Goo"]
Get them as a whole string -
str.split(/[\ _]+/).map {|s| s.capitalize }.join(" ")
=> "Foo Bar Baz Goo"
Here the pure Ruby version:
str = 'foo_bar baz __goo'
str.split(/[ _]+/).map{|s| s[0].upcase+s[1..-1]}.join(" ")

Resources