I'm using this code to list email addresses from a HTML page.
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'in.rb'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
This is sample code I'm parsing:
<a href="mailto:joe#example.com?subject=My Business Is Dying">
But I'm getting more than just the email address. I'm getting this in my results:
joe#example.com?subject=My Business Is Dying
How do I drop off everything after the question mark so it's only the email address?
You could always chop off anything after the ? character:
addresses.map! do |address|
address.sub(/\?.*/, '')
end
I'd probably use one of these two:
str = 'joe#example.com?subject=My Business Is Dying'
str.split('?').first # => "joe#example.com"
str[/^[^?]+/] # => "joe#example.com"
The second is a simple regular expression embedded in String's [] (slice) method. The pattern basically says "start at the beginning and grab everything up until a question mark."
They're equivalent as far as speed goes. I'd probably use the first because it's easier to read.
Related
In controller:
str= "Employee <b><a href=http://xyz.localhost.in:3000/admin/company>Uday Das</a></b> has applied for leave."
I want to remove anchor tag from above string like Employee <b>Uday Das</b> has applied for leave.,
I used this code:
ActionView::Base.full_sanitizer.sanitize(str)
But it removes all the html tags from the string, as a result i am getting Employee Uday Das has applied for leave..
NOTE: I am getting strings which is dynamic, anchor tag position is not fixed, it could be anywhere in the string.
You can use nokogiri gem.
Something like:
require 'nokogiri'
doc = Nokogiri::HTML str
node = doc.at("a")
node.replace(node.text)
puts puts doc.inner_html
# <html><body><p>Employee <b>Uday Das</b> has applied for leave.</p></body></html>
or to match your exact output:
puts doc.at("p").inner_html
# Employee <b>Uday Das</b> has applied for leave.
I got a simple solution:
include ActionView::Helpers::SanitizeHelper
sanitize(str, :tags=>["b"])
For links, you can use strip_links method from ActionView::Helpers::SanitizeHelper
strip_links('Ruby on Rails')
# => Ruby on Rails
strip_links('Please e-mail me at me#email.com.')
# => Please e-mail me at me#email.com.
strip_links('Blog: Visit.')
# => Blog: Visit.
strip_links('<malformed & link')
# => <malformed & link
I need to replace field_to_replace from
...<div>\r\n<span field=\"field_to_replace\">\r\n<div>....
There are multiple occurrences of field_to_replace in the string. I need to replace only this occurrence using the tag before and after it.
Don't use regular expressions to try to search or replace inside HTML or XML unless you are guaranteed that the source layout won't change. It's really easy to use a parser to make the changes, and they'll easily handle changes to the source.
This would replace all occurrences of the string in the HTML:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<div><span field='field_to_replace'><div>")
doc.to_html # => "<div><span field=\"field_to_replace\"><div></div></span></div>"
doc.search('div span[#field]').each do |span|
span['field'] = 'foo'
end
doc.to_html # => "<div><span field=\"foo\"><div></div></span></div>"
If you want to replace just the first occurrence, use at instead of search:
doc = Nokogiri::HTML::DocumentFragment.parse("<div><span field=\"field_to_replace\"><div><span field='field_to_replace'></span></div></span></div>")
doc.to_html # => "<div><span field=\"field_to_replace\"><div><span field=\"field_to_replace\"></span></div></span></div>"
doc.at('div span[#field]')['field'] = 'foo'
doc.to_html # => "<div><span field=\"foo\"><div><span field=\"field_to_replace\"></span></div></span></div>"
By defining the CSS selector you can identify the node quickly and easily. And, if you need even more power then XPath can be used instead of CSS.
The simple way would be:
str = "...<div>\r\n<span field=\"field_to_replace\">\r\n<span field=\"field_to_replace\">\r\n<div>...."
str.split("field_to_replace").join("new_field")
Let us know if you need something more complex.
How do I encode or 'escape' the URL before I use OpenURI to open(url)?
We're using OpenURI to open a remote url and return the xml:
getresult = open(url).read
The problem is the URL contains some user-input text that contains spaces and other characters, including "+", "&", "?", etc. potentially, so we need to safely escape the URL. I saw lots of examples when using Net::HTTP, but have not found any for OpenURI.
We also need to be able to un-escape a similar string we receive in a session variable, so we need the reciprocal function.
Don't use URI.escape as it has been deprecated in 1.9.
Rails' Active Support adds Hash#to_query:
{foo: 'asd asdf', bar: '"<#$dfs'}.to_query
# => "bar=%22%3C%23%24dfs&foo=asd+asdf"
Also, as you can see it tries to order query parameters always the same way, which is good for HTTP caching.
Ruby Standard Library to the rescue:
require 'uri'
user_text = URI.escape(user_text)
url = "http://example.com/#{user_text}"
result = open(url).read
See more at the docs for the URI::Escape module. It also has a method to do the inverse (unescape)
The main thing you have to consider is that you have to escape the keys and values separately before you compose the full URL.
All the methods which get the full URL and try to escape it afterwards are broken, because they cannot tell whether any & or = character was supposed to be a separator, or maybe a part of the value (or part of the key).
The CGI library seems to do a good job, except for the space character, which was traditionally encoded as +, and nowadays should be encoded as %20. But this is an easy fix.
Please, consider the following:
require 'cgi'
def encode_component(s)
# The space-encoding is a problem:
CGI.escape(s).gsub('+','%20')
end
def url_with_params(path, args = {})
return path if args.empty?
path + "?" + args.map do |k,v|
"#{encode_component(k.to_s)}=#{encode_component(v.to_s)}"
end.join("&")
end
def params_from_url(url)
path,query = url.split('?',2)
return [path,{}] unless query
q = query.split('&').inject({}) do |memo,p|
k,v = p.split('=',2)
memo[CGI.unescape(k)] = CGI.unescape(v)
memo
end
return [path, q]
end
u = url_with_params( "http://example.com",
"x[1]" => "& ?=/",
"2+2=4" => "true" )
# "http://example.com?x%5B1%5D=%26%20%3F%3D%2F&2%2B2%3D4=true"
params_from_url(u)
# ["http://example.com", {"x[1]"=>"& ?=/", "2+2=4"=>"true"}]
Ruby has the built-in URI library, and the Addressable gem, in particular Addressable::URI
I prefer Addressable::URI. It's very full featured and handles the encoding for you when you use the query_values= method.
I've seen some discussions about URI going through some growing pains so I tend to leave it alone for handling encoding/escaping until these things get sorted out:
http://osdir.com/ml/ruby-core/2010-06/msg00324.html
http://osdir.com/ml/lang-ruby-core/2009-06/msg00350.html
http://osdir.com/ml/ruby-core/2011-06/msg00748.html
I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.
So, given:
One
Two
I need to get back:
'12345'
and
'67890'
Can anyone lend a hand? I've been struggling with this.
You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:
require 'nokogiri'
html = %q[
One
Two
]
doc = Nokogiri::HTML(html)
doc.css('a[#href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[#href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]
or use regex:
s = 'One'
s =~ /abc\/([^\/]*)/
return $1
What about splitting the string by /?
(I don't know Hpricot, but according to the docs):
doc.search("a[#href]").each do |a|
return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end
I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].