Get HTML structure using Nokogiri - ruby

My task is to get the HTML structure of the document without data. From:
<html>
<head>
<title>Hello!</title>
</head>
<body id="uniq">
<h1>Hello World!</h1>
</body>
</html>
I want to get:
<html>
<head>
<title></title>
</head>
<body id="uniq">
<h1></h1>
</body>
</html>
There are a number of ways to extract data with Nokogiri, but I couldn't find a way perform the reverse task.
UPDATE:
The solution found is the combination of two answers I received:
doc = Nokogiri::HTML(open("test.html"))
doc.at_css("html").traverse do |node|
if node.text?
node.remove
end
end
puts doc
The output is exactly the one I want.

It sounds like you want to remove all the text nodes. You can do this like so:
doc.xpath('//text()').remove
puts doc

Traverse the document. For each node, delete what you don't want. Then write out the document.
Remember that Nokogiri can change the document. Doc

Related

Get element in particular index nokogiri

How can I get the element at index 2.
For example in following HTML I want to display the third element i.e a DIV:
<HTMl>
<DIV></DIV>
<OL></OL>
<DIV> </DIV>
</HTML>
I have been trying the following:
p1 = html_doc.css('body:nth-child(2)')
puts p1
I don't think you're understanding how we use a parser like Nokogiri, because it's a lot easier than you make it out to be.
I'd use:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<HTMl>
<DIV>1</DIV>
<OL></OL>
<DIV>2</DIV>
</HTML>
EOT
doc.at('//div[2]').to_html # => "<div>2</div>"
That's using at which returns the first Node that matches the selector. //div[2] is an XPath selector that will return the second <div> found. search could be used instead of at, but it returns a NodeSet, which is like an array, and would mean I'd need to extract that particular node.
Alternately, I could use CSS instead of XPath:
doc.search('div:nth-child(3)').to_html # => "<div>2</div>"
Which, to me, is not really an improvement over the XPath as far as readability.
Using search to find all occurrences of a particular tag, means I have to select the particular element from the returned NodeSet:
doc.search('div')[1].to_html # => "<div>2</div>"
Or:
doc.search('div').last.to_html # => "<div>2</div>"
The downside to using search this way, is it will be slower and needlessly memory intensive on big documents since search finds all occurrences of the nodes that match the selector in the document, and which are then thrown away after selecting only one. search, css and xpath all behave that way, so, if you only need the first matching node, use at or its at_css and at_xpath equivalents and provide a sufficiently definitive selector to find just the tag you want.
'body:nth-child(2)' doesn't work because you're not using it right, according to ":nth-child()" and how I understand it works. nth-child looks at the tag supplied, and finds the "nth" occurrence of it under its parent. So, you're asking for the third tag under body's "html" parent, which doesn't exist because a correctly formed HTML document would be:
<html>
<head></head>
<body></body
</html>
(How you tell Nokogiri to parse the document determines how the resulting DOM is structured.)
Instead, use: div:nth-child(3) which says, "find the third child of the parent of div, which is "body", and results in the second div tag.
Back to how Nokogiri can be told to parse a document; Meditate on the difference between these:
doc = Nokogiri::HTML(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>foo</p>
# >> </body></html>
and:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <p>foo</p>
If you can modify the HTML add id's and classes to target easily what you are looking for (also add the body tag).
If you can not modify the HTML keep your selector simple and access the second element of the array.
html_doc.css('div')[1]

Replacing part of text string with a page include in Classic ASP

Not 100% sure if this is possible, but hoping there is a workaround. Several hours of searching bring nothing up. I have a text string written to the page from a Db table. If it contains a specific string, I would like to add a page include - example below does write:
<!--#include file="members.asp"-->
into the text, but does not pull the included file content in.
<%=Replace(myQuery("Text"), "123456", "%><!--#include file="mypage.asp"--><% ")%>
Client wants it in the page rather than at the top or bottom of the output which would be so easy (and we already do that) The include has to go in at a specific point in the text.
I would appreciate any help, even if it is to confirm that it is not possible to do this.
Here is the main page:
<!DOCTYPE html>
<html>
<head>
<title>Test</title>
</head>
<body onload="document.getElementById('placeholder').innerText = document.getElementById('alwaysfillme').innerText">
<p><%=Replace("ABCDEFGHIJKLMNOPQRSTUVWXYZ", "J", "<span id=""placeholder""></span>")%></p>
<span id="alwaysfillme" style="display:none;"><!--#include file="mypage.asp"--></span>
</body>
</html>
And here is what I stuck in "mypage.asp":
<% response.Write("--123--") %>
When a J is in the text, it displays:
ABCDEFGHI--123--KLMNOPQRSTUVWXYZ
When no J is in the text, it displays:
ABCDEFGHIKLMNOPQRSTUVWXYZ
<%=Replace(myQuery("Text"), "123456", ""&Server.Execute("mypage.asp")&"")%>

xpath: how to locate a node that contains more than 1 of another specific node

Using xpath, I want to return all section tags that contain more than one title tag. I've tried
count(concept/conbody/section:child:title>1)
and that didn't return the results. I want to run this xpath accross many files to locate those < concepts that have section containing more than one title.
<concept>
<title>Topic Title</title>
<shortdesc>Short description text.</shortdesc>
<conbody>
<section>
<title>Section Title</title>
<p>paragraph text.</p>
</section>
<section>
<title>Section Title</title>
<p>paragraph text.</p>
<title>Section Title</title>
<p>paragraph text.</p>
</section>
</conbody>
</concept>
Depending oo how "fix" the ancestors of section arr you may use_
concept/conbody/section[count(title) >1]
or:
//section[count(title) >1]
Query for section with have a second title element, that saves you from retrieving all which is required for counting them:
concept/conbody/section[title[2]]
this should work
concept/section[count(title)>1]

watir-webdriver: how to retrieve entire line from HTML for which I found substring in it?

I've got something like that in HTML coming from server:
<html ...>
<head ...>
....
<link href="http://mydomain.com/Digital_Cameras--~all" rel="canonical" />
<link href="http://mydomain.com/Digital_Cameras--~all/sec_~product_list/sb_~1/pp_~2" rel="next" />
...
</head>
<body>
...
</body>
</html>
If b holds the browser object navigated to the page I need to look through, I'm able to find rel="canonical" with b.html.include? statement, but how could I retrieve the entire line where this substring was found? And I also need the next (not empty) one.
You can use a css-locator (or xpath) to get link elements.
The following would return the html (which would be the line) for the link element that has the rel attribute value of "canonical":
b.element(:css => 'link[rel="canonical"]').html
#=> <link href="http://mydomain.com/Digital_Cameras--~all" rel="canonical" />
I am not sure what you mean by "I also need the next (not empty) one.". If you mean that you want the one with rel attribute value of "next", you can similarly do:
b.element(:css => 'link[rel="next"]').html
#=> <link href="http://mydomain.com/Digital_Cameras--~all/sec_~product_list/sb_~1/pp_~2" rel="next" />
You could use String#each_line to iterate through each line in b.html and check for rel=:
b.goto('http://www.iana.org/domains/special')
b.html.each_line {|line| puts line if line.include? "rel="}
That should return all strings including rel= (although it could return lines that you don't want, such as <a> tags with rel attributes).
Alternately, you could use nokogiri to parse the HTML:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.iana.org/domains/special"))
nodes = doc.css('link')
nodes.each { |node| puts node}

Inserting an element in local HTML file

I am trying to write a Ruby script that would read a local HTML file, and insert some more HTML (basically a string) into it after a certain #divid.
I am kinda noob so please don't hesitate to put in some code here.
Thanks
I was able to this by following...
doc = Nokogiri::HTML(open('file.html'))
data = "<div>something</div>"
doc.children.css("#divid").first.add_next_sibling(data)
And then (over)write the file with same data...
File.open("file.html", 'w') {|f| f.write(doc.to_html) }
This is a bit more correct way to do it:
html = '<html><body><div id="certaindivid">blah</div></body></html>'
doc = Nokogiri::HTML(html)
doc.at_css('div#certaindivid').add_next_sibling('<div>junk goes here</div>')
print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div id="certaindivid">blah</div>
<div>junk goes here</div>
</body></html>
Notice the use of .at_css(), which finds the first occurrence of the target node and returns it, avoiding getting a nodeset back, and relieving you of the need to grab the .first() node.

Resources