Find email addresses in large data stream

Find email addresses in large data stream - ruby

STILL NOT RESOLVED :( [Feb 11th]
I have a large text file full of random data and want to pull out all the email addresses from it.
I would like to do this in Ruby, with pseudo code like this:
monster_data_string = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
monster_data_string.match(EMAIL_REGEX)
Does anyone know what Ruby email regular expression I would use to accomplish this?
Please keep in mind that I'm looking for a Ruby answer to this. I have already tried numerous regex found by googling but most of them cause Ruby runtime errors stating that characters like "+" and "" are invalid/unrecognized.*
What I have already tried is:
monster_data_string.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
but I receive Ruby errors stating that "+" is an invalid character
Thanks in advance

Watch this...
f = File.open("content.txt")
content = f.read
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)

If you're getting an error message about + or * being invalid in regexes, you're doing something very wrong. This is a valid regex in Ruby, although it's not the one you want:
/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i
For one thing, you don't want to anchor the regex to the start and end of lines (^ and $) if you're trying to pluck the addresses from "random" text. But once you've gotten rid of the anchors, your regex will match **joe#example.com in your test string, which I presume you don't want. This regex from Regular-Expressions.info does a better job, but read that page for tips on tweaking it to meet your particular needs.
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
Finally (and you may already know this), you won't want to use the match() method because that will only find the first match. Try scan() instead.

Given that it is not possible to parse every valid email address using a regexp you are left with two choices:
Make a regexp that matches as many valid email addresses as possible and live with the the fact that some valid but rarely used forms of email address might get overlooked.
or
Make a regexp that Matches anything that "might be" an email address and then live with the false positives
I use the second approach to weed out obviously wrong email addresses when validating user sign up email addresses on a web page
Gleaned from Ruby Cookbook which has a very good section on email address validation:
valid = '[^ #]+'
/^#{valid}##{valid}\.#{valid}/
Apparently there is a 6343 character Perl regexp written by Paul Warren that does a very good job and also works in Ruby, but even that is not foolproof (I think it might also have some performance implications).

What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?

To try and help you get there (though not very elegantly, I admit):
I think the start and end anchors (^ and $) aren't helping. You may also want to filter the asterisks?:
irb(main):001:0> mds = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
=> "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
irb(main):003:0> mds.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
=> nil
irb(main):004:0> mds.match(/([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "**joe#example.com" 1:"**joe" 2:"example.com">
irb(main):005:0> mds.match(/([^#\s*]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "joe#example.com" 1:"joe" 2:"example.com">

Even better,
require 'yaml'
content = "asfsfsdfsdfsf sfda **joe#example.com.au** sdfdsf cool_me#example.com.fr"
r = Regexp.new(/\b([a-zA-Z0-9._%+-]+)#([a-zA-Z0-9.-]+?)(\.[a-zA-Z.]*)\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
will give you
---
- - joe
- example
- .com.au
- - cool_me
- example
- .com.au

Related

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.

I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)

As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.

Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

Get full path in Sinatra route including everything after question mark

I have the following path:
http://192.168.56.10:4567/browse/foo/bar?x=100&y=200
I want absolutely everything that comes after "http://192.168.56.10:4567/browse/" in a string.
Using a splat doesn't work (only catches "foo/bar"):
get '/browse/*' do
Neither does the regular expression (also only catches "foo/bar"):
get %r{/browse/(.*)} do
The x and y params are all accessible in the params hash, but doing a .map on the ones I want seems unreasonable and un-ruby-like (also, this is just an example.. my params are actually very dynamic and numerous). Is there a better way to do this?
More info: my path looks this way because it is communicating with an API and I use the route to determine the API call I will make. I need the string to look this way.

If you are willing to ignore hash tag in path param this should work(BTW browser would ignore anything after hash in URL)
updated answer
get "/browse/*" do
p "#{request.path}?#{request.query_string}".split("browse/")[1]
end
Or even simpler
request.fullpath.split("browse/")[1]

get "/browse/*" do
a = "#{params[:splat]}?#{request.env['rack.request.query_string']}"
"Got #{a}"
end

How to scan a string for an exact keyword match?

I'm scanning names and descriptions of different items in order to see if there are any keyword matches.
In the code below it will return things like 'googler' or 'applecobbler', when what I'm trying to do is get exact matches only:
[name, description].join(" ").downcase.scan(/apple|microsoft|google/)
How should I do this?

My regex skills are pretty weak, but I think you need to use a word boundary:
[name, description].join(" ").downcase.scan(/\b(apple|microsoft|google)\b/)
Rubular example

Depends on what information you want, but if you just want exact match, you do not need regex for the comparing part. Just compare the relevant strings.
splitted_strings = [name, description].join(" ").downcase.split(/\b/)
splitted_strings & %w[apple microsoft google]
# => the words that match given in the order of appearance

Add proper boundaries entities in your regexp (\b). You can also use #grep method. instead of joining:
array.grep(your_regexp)

Looking at the question, and the situation I'd want to do those things, here's what I'd do for an actual program, where I had lists of sources, and their associated texts, and wanted to know the hits, I'd probably write something like this:
require 'pp'
names = ['From: Apple', 'From: Microsoft', 'From: Google.com']
descriptions = [
'"an apple a day..."',
'Microsoft Excel flight simulator... according to Microsoft',
'Searches of Google revealed multiple hits for "google"'
]
targets = %w[apple microsoft google]
regex = /\b(?:#{ Regexp.union(targets).source })\b/i
names.zip(descriptions) do |n,d|
name_hits, description_hits = [n, d].map{ |s| s.scan(regex) }
pp [name_hits, description_hits]
end
Which outputs:
[["Apple"], ["apple"]]
[["Microsoft"], ["Microsoft", "Microsoft"]]
[["Google"], ["Google", "google"]]
This would let me know the letter-case of the words, so I could try to differentiate the apple fruit from Apple the company, and get word counts, helping to show relevance of the text.
The regex looks like:
/\b(?:apple|microsoft|google)\b/i
It's case insensitive but scan will returns words in their original case.
names, descriptions and targets could all come from a database or separate files, helping to separate the data from the code and the need to modify the code as the targets change.
I'd use a list of target words and use Regexp.union to quickly build the pattern.

How to trim text and use it as parameter for next step in watir using ruby

This may be very simple question but I am very new to ruby or any programming language. I want to trim some text and use it as parameter for next step. Can any one please write me code for doing this. I am testing a web application which is used in financial domain. I need to use the cvv2 and expiry date of card number which is generated in next step as parameter. The text which gets displayed on html is
CVV2 - 657  Expiry - 05/12 (mm/yy)
Now from the above text I should some how get only '657' and '0512' as value to use is in next step.
Request for urgent assistance.

If all your strings will be formatted like this, I suggest using regexp, using String#gsub. There's lots of places to learn about regexp if you don't already, and rubular allows you to test it in your browser, with a short cheat sheet.

To do it quickly you could use
card_details = "CVV2 - 657 Expiry - 05/12 (mm/yy)"
card_details = card_details.scan(/\d{2,}/)
cvv2 = card_details[0]
expiry = card_details[1] + card_details[2]
Probably better ways of doing it as I'm no expert, but you said urgent, so.
For getting the text out of the cell you could try (I don't use the original watir anymore, so I might not be able to remember this):
card_details = browser.td(:text => /CVV2/).text
If that doesn't work give this a try (actually on second thought TRY THIS ONE FIRST)
card_details = browser.cell(:text => /CVV2/).text
For these examples I'm assuming your browser object is called "browser".

We can use regular expression to achieve the same,
> "CVV2 - 657 Expiry - 05/12 (mm/yy)".match(/\d{3}/)
=> "657"
>"CVV2 - 657 Expiry - 05/12 (mm/yy)".match(/\d+\/\d+/)
=> "05/12"

Ruby regular expression for asterisks/underscore to strong/em?

As part of a chat app I'm writing, I need to use regular expressions to match asterisks and underscores in chat messages and turn them into <strong> and <em> tags. Since I'm terrible with regex, I'm really stuck here. Ideally, we would have it set up such that:
One to three words, but not more, can be marked for strong/em.
Patterns such as "un*believ*able" would be matched.
Only one or the other (strong OR em) work within one line.
The above parameters are in order of importance, with only #1 being utterly necessary - the others are just prettiness. The closest I came to anything that worked was:
text = text.sub(/\*([(0-9a-zA-Z).*])\*/,'<b>\1<\/b>')
text = text.sub(/_([(0-9a-zA-Z).*])_/,'<i>\1<\/i>')
But it obviously doesn't work with any of our params.
It's odd that there's not an example of something similar already out there, given the popularity of using asterisks for bold and whatnot. If there is, I couldn't find it outside of plugins/gems (which won't work for this instance, as I really only need it in in one place in my model). Any help would be appreciated.

This should help you finish what you are doing:
sub(/\*(.*)\*/,'<b>\1</b>')
sub(/_(.*)_/,'<i>\1</i>')

Firstly, your criteria are a little strange, but, okay...
It seems that a possible algorithm for this would be to find the number of matches in a message, count them to see if there are less than 4, and then try to perform one set of substitutions.
strong_regexp = /\*([^\*]*)\*/
em_regexp = /_([^_]*)_/
def process(input)
if input ~= strong_regexp && input.match(strong_regexp).size < 4
input.sub strong_regexp, "<b>\1<\b>"
elsif input ~= em_regexp && intput.match(em_regexp).size < 4
input.sub em_regexp, "<i>\1<\i>"
end
end
Your specifications aren't entirely clear, but if you understand this, you can tweak it yourself.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Find email addresses in large data stream - ruby

Watch this... f = File.open("content.txt") content = f.read r = Regexp.new(/\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/) emails = content.scan(r).uniq puts YAML.dump(emails)

What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?

Related

Regex in Ruby for a URL that is an image

Get full path in Sinatra route including everything after question mark

How to scan a string for an exact keyword match?

How to trim text and use it as parameter for next step in watir using ruby

Ruby regular expression for asterisks/underscore to strong/em?

Categories

Resources