I have this code to remove URIs using these schemes:
htmldoc.gsub(/#{URI::regexp(['http', 'https', 'ftp', 'mailto'])}/, '')
However, it won't detect a capitalized URI like HTTP or Http unless I add them to the array.
I tried adding the case-insensitive flag i to the regex, but it didn't work.
Any idea how I could achieve this?
URI::regexp calls the default parser's make_regexp which in turn passes the given arguments to Regexp::union and according to its docs: (emphasis mine)
The patterns can be Regexp objects, in which case their options will be preserved, or Strings.
Applied to your problem:
pattern = URI::regexp([/http/i, /https/i, /ftp/i, /mailto/i])
htmldoc = <<-HTML
<html>
<body>
here
here
</body>
</html>
HTML
puts htmldoc.gsub(pattern, '')
Output:
<html>
<body>
here
here
</body>
</html>
Related
I am trying to add a trademark symbol to all the instances of "Imagination Playground" in my HTML document. However I end up with something like this:
<i class="fa fa-trademark"></i>
It seems like the symbol I am using is converted to HTML characters. How can I escape that?
This is my original Ruby code:
body = "<p>Whether you want to build a playground, make play a priority in your community, or learn more about Imagination Playground , we've got webinars for you in March!</p>
<p>As always, all our webinars are FREE. All you need to participate is a phone and a computer with an Internet connection.</p>"
new_body = Nokogiri::HTML(body)
new_body.encoding = 'UTF-8'
new_body.css('p','a').each{ |p|
p.content = p.content.gsub(/Imagination Playground\s/,'Imagination Playground<i class="fa fa-trademark"></i>');
puts new_body
And this is what I get:
<p>Whether you want to build a playground, make play a priority in your community, or learn more about Imagination Playground<i class="fa fa-trademark"></i>, we've got webinars for you in March!</p>
<p>As always, all our webinars are FREE. All you need to participate is a phone and a computer with an Internet connection.</p>
How can I replace that HTML paragraph and escape ampersand and special characters?
Here's how I'd do it:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>Whether you want to build a playground, make play a priority in your community, or learn more about Imagination Playground , we've got webinars for you in March!</p>
<p>As always, all our webinars are FREE. All you need to participate is a phone and a computer with an Internet connection.</p>
EOT
doc.encoding = 'UTF-8'
doc.css('p').each do |p|
p.children = p.content.gsub(/Imagination Playground\s/, 'Imagination Playground<i class="fa fa-trademark"></i>')
end
puts doc
Which results in:
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>Whether you want to build a playground, make play a priority in your community, or learn more about Imagination Playground<i class="fa fa-trademark"></i>, we've got webinars for you in March!</p>
# >> <p>As always, all our webinars are FREE. All you need to participate is a phone and a computer with an Internet connection.</p>
# >> </body></html>
Nokogiri is pretty smart. When it sees children=, it looks to see whether it's receiving a string. If so, it parses that string and converts it into a Node then replaces the existing children with the new node. This is a big difference from using content= which Nokogiri knows should be text, and then will encode the embedded tags into <, etc. This is covered in the documentation.
For children=:
Set the inner html for this Node node_or_tags node_or_tags can be a Nokogiri::XML::Node, a Nokogiri::XML::DocumentFragment, or a string containing markup.
For content=:
Set the Node's content to a Text node containing string. The string gets XML escaped, not interpreted as markup.
this would not work if i want to conserve the html tags that are inside the paragraph, try to do that for <p>fsome test and then <b>bold</b></p>
You are changing the requirements. Don't do that. Be specific about your needs so we can answer the real question once.
A small alteration is needed to take the contents of the desired tag. Use children.to_html to get the HTML string of the embedded nodes then gsub it and use its result:
require 'nokogiri'
doc = Nokogiri::HTML('<p>Imagination Playground<b>foo</b></p>')
puts doc.to_html
Which looks like this to start:
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>Imagination Playground<b>foo</b></p></body></html>
Modify the DOM:
doc.search('p').each do |p|
p.children = p.children.to_html.gsub(/Imagination Playground\s?/, 'Imagination Playground<i class="fa fa-trademark"></i>')
end
puts doc
Which now looks like:
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>Imagination Playground<i class="fa fa-trademark"></i><b>foo</b></p></body></html>
Notice I'm using search instead of css. Use the generic method instead of the more specific. It makes it easier to switch to XPaths if needed.
Also, I'm using a little more intelligent pattern in the gsub to conditionally grab a single trailing whitespace if it's available. It's not essential to do that with HTML because browsers gobble blanks, but it would be the right way to do it if you were dealing with regular text documents or preformatted text.
And, just for more detail about what Nokogiri is seeing:
doc.search('p').first
# => #(Element:0x3fd222462204 {
# name = "p",
# children = [
# #(Text "Imagination Playground"),
# #(Element:0x3fd2224608f0 { name = "b", children = [ #(Text "foo")] })]
# })
doc.search('p').first.children
# => [#<Nokogiri::XML::Text:0x3fd222461688 "Imagination Playground">, #<Nokogiri::XML::Element:0x3fd2224608f0 name="b" children=[#<Nokogiri::XML::Text:0x3fd22245fe64 "foo">]>]
How can I get the element at index 2.
For example in following HTML I want to display the third element i.e a DIV:
<HTMl>
<DIV></DIV>
<OL></OL>
<DIV> </DIV>
</HTML>
I have been trying the following:
p1 = html_doc.css('body:nth-child(2)')
puts p1
I don't think you're understanding how we use a parser like Nokogiri, because it's a lot easier than you make it out to be.
I'd use:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<HTMl>
<DIV>1</DIV>
<OL></OL>
<DIV>2</DIV>
</HTML>
EOT
doc.at('//div[2]').to_html # => "<div>2</div>"
That's using at which returns the first Node that matches the selector. //div[2] is an XPath selector that will return the second <div> found. search could be used instead of at, but it returns a NodeSet, which is like an array, and would mean I'd need to extract that particular node.
Alternately, I could use CSS instead of XPath:
doc.search('div:nth-child(3)').to_html # => "<div>2</div>"
Which, to me, is not really an improvement over the XPath as far as readability.
Using search to find all occurrences of a particular tag, means I have to select the particular element from the returned NodeSet:
doc.search('div')[1].to_html # => "<div>2</div>"
Or:
doc.search('div').last.to_html # => "<div>2</div>"
The downside to using search this way, is it will be slower and needlessly memory intensive on big documents since search finds all occurrences of the nodes that match the selector in the document, and which are then thrown away after selecting only one. search, css and xpath all behave that way, so, if you only need the first matching node, use at or its at_css and at_xpath equivalents and provide a sufficiently definitive selector to find just the tag you want.
'body:nth-child(2)' doesn't work because you're not using it right, according to ":nth-child()" and how I understand it works. nth-child looks at the tag supplied, and finds the "nth" occurrence of it under its parent. So, you're asking for the third tag under body's "html" parent, which doesn't exist because a correctly formed HTML document would be:
<html>
<head></head>
<body></body
</html>
(How you tell Nokogiri to parse the document determines how the resulting DOM is structured.)
Instead, use: div:nth-child(3) which says, "find the third child of the parent of div, which is "body", and results in the second div tag.
Back to how Nokogiri can be told to parse a document; Meditate on the difference between these:
doc = Nokogiri::HTML(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>foo</p>
# >> </body></html>
and:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <p>foo</p>
If you can modify the HTML add id's and classes to target easily what you are looking for (also add the body tag).
If you can not modify the HTML keep your selector simple and access the second element of the array.
html_doc.css('div')[1]
I've got something like that in HTML coming from server:
<html ...>
<head ...>
....
<link href="http://mydomain.com/Digital_Cameras--~all" rel="canonical" />
<link href="http://mydomain.com/Digital_Cameras--~all/sec_~product_list/sb_~1/pp_~2" rel="next" />
...
</head>
<body>
...
</body>
</html>
If b holds the browser object navigated to the page I need to look through, I'm able to find rel="canonical" with b.html.include? statement, but how could I retrieve the entire line where this substring was found? And I also need the next (not empty) one.
You can use a css-locator (or xpath) to get link elements.
The following would return the html (which would be the line) for the link element that has the rel attribute value of "canonical":
b.element(:css => 'link[rel="canonical"]').html
#=> <link href="http://mydomain.com/Digital_Cameras--~all" rel="canonical" />
I am not sure what you mean by "I also need the next (not empty) one.". If you mean that you want the one with rel attribute value of "next", you can similarly do:
b.element(:css => 'link[rel="next"]').html
#=> <link href="http://mydomain.com/Digital_Cameras--~all/sec_~product_list/sb_~1/pp_~2" rel="next" />
You could use String#each_line to iterate through each line in b.html and check for rel=:
b.goto('http://www.iana.org/domains/special')
b.html.each_line {|line| puts line if line.include? "rel="}
That should return all strings including rel= (although it could return lines that you don't want, such as <a> tags with rel attributes).
Alternately, you could use nokogiri to parse the HTML:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.iana.org/domains/special"))
nodes = doc.css('link')
nodes.each { |node| puts node}
My task is to get the HTML structure of the document without data. From:
<html>
<head>
<title>Hello!</title>
</head>
<body id="uniq">
<h1>Hello World!</h1>
</body>
</html>
I want to get:
<html>
<head>
<title></title>
</head>
<body id="uniq">
<h1></h1>
</body>
</html>
There are a number of ways to extract data with Nokogiri, but I couldn't find a way perform the reverse task.
UPDATE:
The solution found is the combination of two answers I received:
doc = Nokogiri::HTML(open("test.html"))
doc.at_css("html").traverse do |node|
if node.text?
node.remove
end
end
puts doc
The output is exactly the one I want.
It sounds like you want to remove all the text nodes. You can do this like so:
doc.xpath('//text()').remove
puts doc
Traverse the document. For each node, delete what you don't want. Then write out the document.
Remember that Nokogiri can change the document. Doc
"Nokogiri: How to select nodes by matching text?" can do this via XPath, however, I am looking for a way to use a CSS select that matches the text of element.
PyQuery and PHPQuery can do this. Isn't there a jQuery API lib for Ruby?
Nokogiri (now) implements jQuery selectors, making it possible to search the text of a node:
For instance:
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
doc.at('p:contains("bar")').text.strip
=> "bar"
Cannot be done with pure CSS, you'll have to mix it with Ruby code
doc = Nokogiri::HTML("<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>")
p doc.css('li').select{|li|li.text =~ /Apple/}