I need to extract all #usernames from a string(for twitter) using rails/ruby:
String Examples:
"#tom #john how are you?"
"how are you #john?"
"#tom hi"
The function should extract all usernames from a string, plus without special characters disallowed for usernames... as you see "?" in an example...
From "Why can't I register certain usernames?":
A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.
The \w metacharacter is equivalent to [a-zA-Z0-9_]:
/\w/ - A word character ([a-zA-Z0-9_])
Simply scanning for #\w+ will succeed according to that:
strings = [
"#tom #john how are you?",
"how are you #john?",
"#tom hi",
"#foo #_foo #foo_ #foo_bar #f123bar #f_123_bar"
]
strings.map { |s| s.scan(/#\w+/) }
# => [["#tom", "#john"],
# ["#john"],
# ["#tom"],
# ["#foo", "#_foo", "#foo_", "#foo_bar", "#f123bar", "#f_123_bar"]]
There are multiple ways to do it - here's one way:
string = "#tom #john how are you?"
words = string.split " "
twitter_handles = words.select do |word|
word.start_with?('#') && word[1..-1].chars.all? do |char|
char =~ /[a-zA-Z1-9\_]/
end && word.length > 1
end
The char =~ regex will only accept alphaneumerics and the underscore
r = /
# # match character
[[[:alpha:]]]+ # match one or more letters
\b # match word break
/x # free-spacing regex definition mode
"#tom #john how are you? And you, #andré?".scan(r)
#=> ["#tom", "#john", "#andré"]
If you wish to instead return
["tom", "john", "andré"]
change the first line of the regex from # to
(?<=#)
which is a positive lookbehind. It requires that the character "#" be present but it will not be part of the match.
Related
I have a string with exclamation marks. I want to remove the exclamation marks at the end of the word, not the ones before a word. Assume there is no exclamation mark by itself/ not accompanied by a word. By word I mean [a..z], can be uppercased.
For example:
exclamation("Hello world!!!")
#=> ("Hello world")
exclamation("!!Hello !world!")
#=> ("!!Hello !world")
I have read How do I remove substring after a certain character in a string using Ruby? ; these two are close, but different.
def exclamation(s)
s.slice(0..(s.index(/\w/)))
end
# exclamation("Hola!") returns "Hol"
I have also tried s.gsub(/\w!+/, ''). Although it retains the '!' before word, it removes both the last letter and exclamation mark. exclamation("!Hola!!") #=> "!Hol".
How can I remove only the exclamation marks at the end?
If you don't want to use regex that sometimes difficult to understand use this:
def exclamation(sentence)
words = sentence.split
words_wo_exclams = words.map do |word|
word.split('').reverse.drop_while { |c| c == '!' }.reverse.join
end
words_wo_exclams.join(' ')
end
Although you haven't given a lot of test data, here's an example of something that might work:
def exclamation(string)
string.gsub(/(\w+)\!(?=\s|\z)/, '\1')
end
The \s|\z part means either a space or the end of the string, and (?=...) means to just peek ahead in the string but not actually match against it.
Note that this won't work in the case of things like "I'm mad!" where the exclamation mark is not adjacent to a space, but you could always add that as another potential end-of-word match.
"!!Hello !world!, world!! I say".gsub(r, '')
#=> "!!Hello !world, world! I say"
where
r = /
(?<=[[:alpha:]]) # match an uppercase or lowercase letter in a positive lookbehind
! # match an exclamation mark
/x # free-spacing regex definition mode
or
r = /
[[:alpha:]] # match an uppercase or lowercase letter
\K # discard match so far
! # match an exclamation mark
/x # free-spacing regex definition mode
If the above example should return "!!Hello !world, world I say", change ! to !+ in the regexes.
I am looking to find method names for python functions. I only want to find method names if they aren't after "def ". E.g.:
"def method_name(a, b):" # (should not match)
"y = method_name(1,2)" # (should find `method_name`)
My current regex is /\W(.*?)\(/.
str = "def no_match(a, b):\ny = match(1,2)"
str.scan(/(?<!def)\s+\w+(?=\()/).map(&:strip)
#⇒ ["match"]
The regex comments:
negative lookbehind for def,
followed by spaces (will be stripped later),
followed by one or more word symbols \w,
followed by positive lookahead for parenthesis.
Sidenote: one should never use regexps to parse long strings for any purpose.
I have assumed that lines that do not contain "def" are of the form "[something]=[zero or more spaces][method name]".
R1 = /
\bdef\b # match 'def' surrounded by word breaks
/x # free-spacing regex definition mode
R2 = /
[^=]+ # match any characters other than '='
= # match '='
\s* # match >= 0 whitespace chars
\K # forget everything matched so far
[a-z_] # match a lowercase letter or underscore
[a-z0-9_]* # match >= 0 lowercase letters, digits or underscores
[!?]? # possibly match '!' or '?'
/x
def match?(str)
(str !~ R1) && str[R2]
end
match?("def method_name1(a, b):") #=> false
match?("y = method_name2(1,2)") #=> "method_name2"
match?("y = method_name") #=> "method_name"
match?("y = method_name?") #=> "method_name?"
match?("y = def method_name") #=> false
match?("y << method_name") #=> nil
I chose to use two regexes to be able to deal with both my first and penultimate examples. Note that the method returns either a method name or a falsy value, but the latter may be either false or nil.
So I've got a string that's an improperly formatted name. Let's say, "Jean-paul Bertaud-alain".
I want to use a regex in Ruby to find the first character after every dash and make it uppercase. So, in this case, I want to apply a method that would yield: "Jean-Paul Bertaud-Alain".
Any help?
String#gsub can take a block argument, so this is as simple as:
str = "Jean-paul Bertaud-alain"
str.gsub(/-[a-z]/) {|s| s.upcase }
# => "Jean-Paul Bertaud-Alain"
Or, more succinctly:
str.gsub(/-[a-z]/, &:upcase)
Note that the regular expression /-[a-z]/ will only match letters in the a-z range, meaning it won't match e.g. à. This is because String#upcase does not attempt to capitalize characters with diacritics anyway, because capitalization is language-dependent (e.g. i is capitalized differently in Turkish than in English). Read this answer for more information: https://stackoverflow.com/a/4418681
"Jean-paul Bertaud-alain".gsub(/(?<=-)\w/, &:upcase)
# => "Jean-Paul Bertaud-Alain"
I suggest you make the test more demanding by requiring the letter to be upcased: 1) be preceded by a capitalized word followed by a hypen and 2) be followed by lowercase letters followed by a word break.
r = /
\b # Match a word break
[A-Z] # Match an upper-case letter
[a-z]+ # Match >= 1 lower-case letters
\- # Match hypen
\K # Forget everything matched so far
[a-z] # Match a lower-case letter
(?= # Begin a positive lookahead
[a-z]+ # Match >= 1 lower-case letters
\b # Match a word break
) # End positive lookahead
/x # Free-spacing regex definition mode
"Jean-paul Bertaud-alain".gsub(r) { |s| s.upcase }
#=> "Jean-Paul Bertaud-Alain"
"Jean de-paul Bertaud-alainM".gsub(r) { |s| s.upcase }
#=> "Jean de-paul Bertaud-alainM"
How can I write a regex to match a string that does not start or end with a white space character? A matching string can have any character in the middle, and importantly, a single-character string should match.
My attempt was:
/\A\S.*\S\z/
but this will not match a single character.
This is one of the cases where you should not attempt to build a regex that matches something, but rather a regex that matches the complement of something, and use the regex negatively.
re = /\A\s|\s\z/
re !~ " " # => false
re !~ "" # => true
re !~ "sss" # => true
re !~ "s ss" # => true
re !~ " s ss" # => false
is_ok = lambda do |str|
a, z = str.chars.first, str.chars.last
"#{a}#{z}" =~ / |\n|\t/ ? false : true
end
#"more elegant" (yeah dude I rock)
is_ok = lambda {|str| [0, -1].map{|i| str.chars[i] }.join =~ / |\n|\t/ ? false : true}
Use this regex:
\A\S+(?:\s*\S+)*\Z
You can play with the Test String part of this demo to see how this works. I'm assuming that strings can span multiple lines, hence the \A and \Z
In Ruby, something like:
if subject =~ /\A\S+(?:\s*\S+)*\Z/
match = $&
Explanation
The \A anchor asserts that we are at the beginning of the subject string
\S+ matches one or more non-whitespace characters (including tabs, newlines etc.) Alternaltely, if you want to allow newlines at the beginning but only want to exclude a space character, you can use [^ ]+ instead of \S+
(?:\s*\S+) matches any number of optional whitespace characters, followed by one or more non-space characters
The * quantifier repeats that zero or more times
The \Z anchor asserts that we are at the end of the subject string
Use lookaheads, like this:
\A(?=\S).*\S\Z
Regex101 Demo
This matches the start of the string and requires (1) that the first character be a non-whitespace character and (2) that the last character be a non-whitespace character.
Matches:
a
a b
a b c d 1231 e
Non matches:
(just a space)
a (leading space)
b (trailing space)
empty string
Suppose I said £ character as dangerous, and I want to be able to protect and to unprotect any string. And vice versa.
Example 1:
"Foobar £ foobar foobar foobar." # => dangerous string
"Foobar \£ foobar foobar foobar." # => protected string
Example 2:
"Foobar £ foobar £££££££foobar foobar." # => dangerous string
"Foobar \£ foobar \£\£\£\£\£\£\£foobar foobar." # => protected string
Example 3:
"Foobar \£ foobar \\£££££££foobar foobar." # => dangerous string
"Foobar \£ foobar \\\£\£\£\£\£\£\£foobar foobar." # => protected string
Is there an easy way, with Ruby, to escape (and unescape) a given character (such as £ in my example) from a string?
Edit: here is an explication about the behavior of this question.
First of all, thanks for your answers. I have a Rails app with a Tweet model having a content field. Example of tweet:
tweet = Tweet.create(content: "Hello #bob")
Inside the model, there's a serialization process that converte the string like this:
dump('Hello #bob') # => '["Hello £", 42]'
# ... where 42 is the id of bob username
Then, I'm able to deserialize and display its tweet like this:
load('["Hello £", 42]') # => 'Hello #bob'
In the same way, it's also possible to do so with more than one username:
dump('Hello #bob and #joe!') # => '["Hello £ and £!", 42, 185]'
load('["Hello £ and £!", 42, 185]') # => 'Hello #bob and #joe!'
That's the goal :)
But this find-and-replace could be hard to perform with something like:
tweet = Tweet.create(content: "£ Hello #bob")
'cause here we also have to escape £ char. And I think your solution is good for this. So the result become:
dump('£ Hello #bob') # => '["\£ Hello £", 42]'
load('["\£ Hello £", 42]') # => '£ Hello #bob'
Just perfect. <3 <3
Now, if there is this:
tweet = Tweet.create(content: "\£ Hello #bob")
I think we first should escape every \, and then escape every £, like:
dump('\£ Hello #bob') # => '["\\£ Hello £", 42]'
load('["\\£ Hello £", 42]') # => '£ Hello #bob'
However... how can we do in this case:
tweet = Tweet.create(content: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\£ Hello #bob")
...where tweet.content.gsub(/(?<!\\)(?=(?:\\\\)*£)/, "\\") seems not working.
Hopefully your version of ruby supports lookbehinds. If it doesn't my solution will not work for you.
Escape characters :
str = str.gsub(/(?<!\\)(?=(?:\\\\)*£)/, "\\")
Un-escape characters :
str = str.gsub(/(?<!\\)((?:\\\\)*)\\£/, "\1£")
Both regexes will work regardless of the amount of backslashes. They are complementing each other.
Escape explanation :
"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
\\ # Match the character “\” literally
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
\\ # Match the character “\” literally
\\ # Match the character “\” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
£ # Match the character “£” literally
)
"
Not that I am matching a certain position. No text is consumed at all. When I pinpoint the position I want I insert a \.
Explanation of unescape :
"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
\\ # Match the character “\” literally
)
( # Match the regular expression below and capture its match into backreference number 1
(?: # Match the regular expression below
\\ # Match the character “\” literally
\\ # Match the character “\” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
\\ # Match the character “\” literally
£ # Match the character “£” literally
"
Here I am saving all the backslashes minus one and and I replace this number of backslashes with the special character. Tricky stuff :)
If you are using Ruby 1.9, which has lookbehind, then FailedDev's answer should work quite well. If you are using Ruby 1.8, which does not have lookbehind (I think), a different approach may work. Give this a try:
text.gsub!(/(\\.)|£)/m) do
if ($1 != nil) # If escaped anything
"$1" # replace with self.
else # Otherwise escape the
"\\£" # unescaped £.
end
end
Note that I am not a Ruby programmer and this snippet is untested (in particular I'm not sure if the: if ($1 != nil) statement usage is correct - it may need to be: if ($1 != "") or if ($1)), but I do know that this general technique (using code in place of a simple replacement string) works. I recently used this same technique for my JavaScript solution to a similar question which was looking to find unescaped asterisks.
I'm not sure if this is what you want, but I think you can do a simple find-and-replace:
str = str.gsub("£", "\\£") # to escape
str = str.gsub("\\£", "£") # to unescape
Note that I changed \ to \\ because you have to escape the backslash in a double-quoted string.
Edit: I think what you want is a regex that matches an odd number of backslashes:
str = str.gsub(/(^|[^\\])((?:\\\\)*)\\£/, "\\1\\2£")
That does the following transformations
"£" #=> "£"
"\\£" #=> "£"
"\\\\£" #=> "\\\\£"
"\\\\\\£" #=> "\\\\£"