Regex to match all alphanumeric hashtags, no symbols - ruby

I am writing a hashtag scraper for facebook, and every regex I come across to get hashtags seems to include punctuation as well as alphanumeric characters. Here's an example of what I would like:
Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression.
I would like it to match world, m4king, fac and expression (note that I would like it to cut off if it reaches punctuation, including spaces). It would be nice if it didn't include the hash symbol, but it's not super important.
Just incase it's important, I will be using ruby's string scan method to grab possibly more than one tag.
Thanks heaps in advance!

A regex such as this: #([A-Za-z0-9]+) should match what you need and place it in a capture group. You can then access this group later. Maybe this will help shed some light on regular expressions (from a Ruby context).
The regex above will start matching when it finds a # tag and will throw any following letters or numbers into a capture group. Once it finds anything which is not a letter or a digit, it will stop the matching. In the end you will end up with a group containing what you are after.

str = 'Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression'
str.scan(/#([A-Za-z0-9]+)/).flatten #=> ["world", "m4king", "fac", "expression"]
The call to #flatten is needed because each capture group will be inside its own array.
Alternatively, you can use look-behind matching which will match alphanumeric characters only after a '#':
str.scan /(?<=#)[[:alnum:]]+/ #=> ["world", "m4king", "fac", "expression"]

Here's a simpler regex #[[:alnum:]_]/. Note it includes underscores because Facebook currently includes underscores as part of hashtags (as does twitter).
str = 'Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression'
str.scan(/#[[:alnum:]_]+/)
Here's a view on Rubular:
http://rubular.com/r/XPPqwtVGN9

Related

Can someone help me with Ruby regex to check any word with letters starting with t and ending with r and replace with word Twitter? Thank you

Can someone help me with Ruby regex to check any word with letters starting with t and ending with r and replace with word Twitter? Thank you
I find that Rubular is very useful for working out how regexes work in Ruby.
You have two questions here. First, what regex will recognise what you want. Second, how to replace that found string with something else.
Your regex will be something like /\bt\w*r\b/. The elements here are \b, which is a word boundary. Then, we have the letter t, then any number of word characters \w*, then the letter r, and finally another word boundary \b. (Without the word-boundary characters, your regex will find t...r inside other words, too, so will work on things like 'stress', 'stirs' etc.
To do the replacement you want the gsub method.
new_string = your_string.gsub(/\bt\w*r\b/i, 'Twitter')
This will substitute the string Twitter for the found regex. The i on the end of the regex makes it case-insensitive - omit this if you want it to only find the lower-case text as in the regex.

Working with Ruby class: Capitalizing a string

I'm trying to get my head around how to work with Classes in Ruby and would really appreciate some insight on this area. Currently, I've got a rather simple task to convert a string with the start of each word capitalized. For example:
Not Jaden-Cased: "How can mirrors be real if our eyes aren't real"
Jaden-Cased: "How Can Mirrors Be Real If Our Eyes Aren't Real"
This is my code currently:
class String
def toJadenCase
split
capitalize
end
end
#=> usual case: split.map(&:capitalize).join(' ')
Output:
Expected: "The Moment That Truth Is Organized It Becomes A Lie.",
instead got: "The moment that truth is organized it becomes a lie."
I suggest you not pollute the core String class with the addition of an instance method. Instead, just add an argument to the method to hold the string. You can do that as follows, by downcasing the string then using gsub with a regular expression.
def to_jaden_case(str)
str.downcase.gsub(/(?<=\A| )[a-z]/) { |c| c.upcase }
end
to_jaden_case "The moMent That trUth is organized, it becomes a lie."
#=> "The Moment That Truth Is Organized, It Becomes A Lie."
Ruby's regex engine performs the following operations.
(?<=\A| ) : use a positive lookbehind to assert that the following match
is immediately preceded by the start of the string or a space
[a-z] : match a lowercase letter
(?<=\A| ) can be replaced with the negative lookbehind (?<![^ ]), which asserts that the match is not preceded by a character other than a space.
Notice that by using String#gsub with a regular expression (unlike the split-process-join dance), extra spaces are preserved.
When spaces are to be matched by a regular expression one often sees whitespaces (\s) matched instead. Here, for example, /(?<=\A|\s)[a-z]/ works fine, but sometimes matching whitespaces leads to problems, mainly because they also match newlines (\n) (as well as spaces, tabs and a few other characters). My advice is to match space characters if spaces are to be matched. If tabs are to be matched as well, use a character class ([ \t]).
Try:
def toJadenCase
self.split.map(&:capitalize).join(' ')
end

Extract a word from a sentence in Ruby

I have the following string:
str = "XXX host:1233455 YYY ZZZ!"
I want to extract the value after host: from this string.
Is there any optimal way in Ruby to do this using RegExp, avoiding multiple loops?
Any solution is welcome.
If you have numbers, use the following regex:
(?<=host:)\d+
The lookbehind will find the numbers right after host:.
See IDEONE demo:
str = "XXX host:1233455 YYY ZZZ!"
puts str.match(/(?<=host:)\d+/)
Note that if you want to match alphanumerics and not any punctuation, you can replace \d+ with \w+.
Also, if you also have dots, or commas inside, you can use
/(?<=host:)\d+(?:[.,]\d+)*/
It will extract values like 4,445 or 44.45.455.
UPDATE:
In case you need a more universal solution (especially if you need to use the regex on another platform where look-behind is not supported (as in JavaScript), use capture group approach:
str.match(/\bhost:(\d+)/).captures.first
Note that \b makes sure we find host: as a whole word, not localhost:. (\d+) is the capture group whose value we can refer to with the backreferences, or via .captures.first in Ruby.
str[/host:(\d+)/, 1]
# => "1233455"
What about the regex:
host:(\S+)
Here you can find a demo.
You can capture the value for example.
str.match(/host:(\d+)/).captures.first

Regex Match Until Word Contained in Array

Using Ruby 1.8.7
I need to grab everything up to a certain word - and I would like to match against words in an array. Example:
match_words = ['title','author','pages']
item = "Title: Jurassic Park\n"
item += "Author: Michael Crichton\n"
if item =~ /title: (.*)#{match any word in match_words array}/i
#do something here
end
So, this would ideally return "Jurassic Park\n". I am currently matching on newlines but have found that the data I will be matching against might have newlines in strange places, like the middle of the sentence. So, I think matching to the next match_word would be a good idea.
Is this possible, or maybe can be done another way?
Try this on for size
item.scan(/(title|author|pages):\s*?(.+)/i)
What this says is find all the results that start (case-insensitive) with either title, author or pages, are then followed by a colon and option white space and then characters. Capture the label and then the characters following the whitespace. The scan method will match as many times as it can.
Just iterate over the match words and do the regex compare as you normally would.
match_words.each do |word|
if item =~ /#{word}/ # Plus case sensitivity, start/end of item, etc.
# etc.
end
end
But if you know that the things you care about are at the beginning of the lines, then split the input string on \n and just use start_with instead of bothering with the regex--that partially depends on what the real data looks like.
First, create a | separated list of keywords from match_words.
Then, use string.scan to split the string apart, giving you an array of arrays with your results. See the end of this tutorial for a reference.
Here's my best shot:
keywords = match_words.join('|')
results = item.scan(/(#{keywords}):\s*(.+?)\s*(?= (#{keywords}):)/im)
Results: [["Title", "Jurassic Park"], ["Author", "Michael Crichton"]]
Don't forget to use the /m switch to indicate that you want . to match newlines.
To explain the pattern: we look for a keyword, then use a "look ahead" (?= ) to find the next keyword without capturing it. We capture all characters in between using a "lazy" expression .+?, so that we don't capture other keywords.

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources