How do you split an email on lines? - ruby

I am retrieving emails using the Fetcher plugin for Rails. It is doing a fine job. But I am trying to split the body of the email on newlines but it appears that it is only one really long line.
What is the best way (in Ruby) to split an email up into multiple lines?

Sounds like you need a word wrapping algorithm. Here is a short and clever way of word wrapping in Ruby that I found on the ruby-talk mailing list (link is to Google's cache because the site seems to be down):
puts $<.read.gsub(/\t/," ").gsub(/.{1,50}(?:\s|\Z)/){($& +
5.chr).gsub(/\n\005/,"\n").gsub(/\005/,"\n")}
Here's a slightly prettier version wrapped in a method:
def wordwrap(str, columns=80)
str.gsub(/\t/, " ").gsub(/.{1,#{ columns }}(?:\s|\Z)/) do
($& + 5.chr).gsub(/\n\005/, "\n").gsub(/\005/, "\n")
end
end

Related

String parse using regex

I have a string which is a function call. I want to parse it and obtain the parameters:
"add_location('http://abc.com/page/1/','This is the title, it is long',39.677765,-45.4343,34454,'http://abc.com/images/image_1.jpg')"
It has a total of 6 parameters and is a mixture of urls, integers and decimals. I can't figure out the regex for the split method which I will be using. Please help!
This is what I have come up with - which is wrong.
/('(.*\/[0-9]*)',)|([0-9]*,)/
Treating the string like a CSV might work:
require 'csv'
str = "add_location('http://abc.com/page/1/','This is the title, it is long',39.677765,-45.4343,34454,'http://abc.com/images/image_1.jpg')"
p CSV.parse(str[13..-2], :quote_char => "'").first
# => ["http://abc.com/page/1/", "This is the title, it is long", "39.677765", "-45.4343", "34454", "http://abc.com/images/image_1.jpg"]
Assuming all non-numeric parameters are enclosed in single quotes, as in your example
string.scan( /'.+?'|[-0-9.]+/ )
You really don't want to be parsing things this complex with a reg-ex; it just won't work in the long run. I'm not sure if you just want to parse this one string, or if there are lots of strings in this form which vary in exact contents. If you give a bit more info about your end goal, you might be able to get some more detailed help.
For parsing things this complex in the general case, you really want to perform proper tokenization (i.e. lexical analysis) of the string. In the past with Ruby, I've had good experiences doing this with Citrus. It's a nice gem for parsing complex tokens/languages like you're trying to do. You can find more about it here:
https://github.com/mjijackson/citrus

Ruby Regex to capture everything between two strings (inclusive)

I'm trying to sanitize some HTML and just remove a single tag (and I'd really like to avoid using nokogiri, etc). So I've got the following string appearing I want to get rid of:
<div class="the_class>Some junk here that's different every time</div>
This appears exactly once in my string, and I'd like to find a way to remove it. I've tried coming up with a regex to capture it all but I can't find one that works.
I've tried /<div class="the_class">(.*)<\/div>/m and that works, but it'll also match up to and including any further </div> tags in the document, which I don't want.
Any ideas on how to approach this?
I believe you're looking for an non-greedy regex, like this:
/<div class="the_class">(.*?)<\/div>/m
Note the added ?. Now, the capturing group will capture as little as possible (non-greedy), instead of as most as possible (greedy).
Because it adds another dependency and slows my work down. Makes things more complicated. Plus, this solution is applicable to more than just HTML tags. My start and end strings can be anything.
I used to think the same way until I got a job writing spiders and web-site analytics, then writing a big RSS-aggregation system -- A parser was the only way out of that madness. Without it the work would never have been finished.
Yes, regex are good and useful, but there are dragons waiting for you. For instance, this common string will cause problems:
'<div class="the_class"><div class="inner_div">foo</div></div>'
The regex /<div class="the_class">(.*?)<\/div>/m will return:
"<div class=\"the_class\"><div class=\"inner_div\">foo</div>"
This malformed, but renderable HTML:
<div class="the_class"><div class="inner_div">foo
is even worse:
'<div class="the_class"><div class="inner_div">foo'[/<div class="the_class">(.*?)<\/div>/m]
=> nil
Whereas, a parser can deal with both:
require 'nokogiri'
[
'<div class="the_class"><div class="inner_div">foo</div></div>',
'<div class="the_class"><div class="inner_div">foo'
].each do |html|
doc = Nokogiri.HTML(html)
puts doc.at('div.the_class').text
end
Outputs:
foo
foo
Yes, your start and end strings could be anything, but there are well-recognized tools for parsing HTML/XML, and as your task grows the weaknesses in using regex will become more apparent.
And, yes, it's possible to have a parser fail. I've had to process RSS feeds that were so badly malformed the parser blew up, but a bit of pre-processing fixed the problem.

How to pull the email address out of this string?

Here are two possible email string scenarios:
email = "Joe Schmoe <joe#example.com>"
email = "joe#example.com"
I always only want joe#example.com.
So what would the regex or method be that would account for both scenarios?
This passes your examples:
def find_email(string)
string[/<([^>]*)>$/, 1] || string
end
find_email "Joe Schmoe <joe#example.com>" # => "joe#example.com"
find_email "joe#example.com" # => "joe#example.com"
If you know your email is always going to be in the < > then you can do a sub string with those as the starting and ending indexes.
If those are the only two formats, don't use a regex. Just use simple string parsing. IF you find a "<>" pair, then pull the email address out from between them, and if you don't find those characters, treat the whole string as the email address.
Regexes are great when you need them, but if you have very simple patterns, then the overhead of loading in and parsing down the regex and processing with it will be much higher than simple string manipulation. Not loading in extra libraries other than what is very core in a language will almost always be faster than going a different route.
If you are willing to load an extra library, this has already been solved in the TMail gem:
http://lindsaar.net/2008/4/13/tip-5-cleaning-up-an-verifying-an-email-address-with-ruby-on-rails
TMail::Address.parse('Mikel A. <spam#lindsaar.net>').spec
=> "spam#lindsaar.net"

how to convert strings like "this is an example" to "this-is-an-example" under ruby

How do I convert strings like "this is an example" to "this-is-an-example" under ruby?
The simplest version:
"this is an example".tr(" ", "-")
#=> "this-is-an-example"
You could also do something like this, which is slightly more robust and easier to extend by updating the regular expression:
"this is an example".gsub(/\s+/, "-")
#=> "this-is-an-example"
The above will replace all chunks of white space (any combination of multiple spaces, tabs, newlines) to a single dash.
See the String class reference for more details about the methods that can be used to manipulate strings in Ruby.
If you are trying to generate a string that can be used in a URL, you should also consider stripping other non-alphanumeric characters (especially the ones that have special meaning in URLs), or replacing them with an alphanumeric equivalent (example, as suggested by Rob Cameron in his answer).
If you are trying to make something that is a good URL slug, there are lots of ways to do it.
Generally, you want to remove everything that is not a letter or number, and then replace all whitespace characters with dashes.
So:
s = "this is an 'example'"
s = s.gsub(/\W+/, ' ').strip
s = s.gsub(/\s+/,'-')
At the end s will equal "this-is-an-example"
I used the source code from a ruby testing library called contest to get this particular way to do it.
If you're using Rails take a look at parameterize(), it does exactly what you're looking for:
http://api.rubyonrails.org/classes/ActiveSupport/CoreExtensions/String/Inflections.html#M001367
foo = "Hello, world!"
foo.parameterize => 'hello-world'

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.
Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.
To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.
What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Resources