How do I replace tags defining a node? - ruby

We're trying to move from a rather small bug tracking system to Redmine. For our old system, there's no ready migration solution script available, so we want to do that ourselves.
I suggested using Nokogiri to move some of the formatting over to the new format (Textile), however, I ran into problems.
This is from the DB field in our old system's DB:
<ul>
<li>list item 1</li>
<li>list item 2</li>
</ul>
This needs to be translated into Textile, and it would look like this:
* list item 1
* list item 2
Now, starting to parse using Nokogiri, I'm here:
def self.handle_ul(page)
uls = page.css("ul")
uls.each {|ul|
lis = ul.css("li")
lis.each { |li|
li.inner_html = "*" << li.text << "\n"
}
}
end
This works like a charm. However, I need to do two replacements:
<li>
</li>
tags need to be removed from the <li> object, and:
<ul>
</ul>
tags need to be removed from the <ul> object. However, I cannot seem to find the actual tags in the object representing it. inner_html returned only the HTML between the tags I'm looking for:
ul.inner_html
Results in:
<li>list item 1</li>
<li>list item 2</li>
Where can I find the tags I need to replace? I thought about using parent and reassociate the child <li> tags with the parent.parent, but that would order them at the end of the grandparent.
Can I somehow access the whole HTML representation of an object, without stripping its defining tags out, so that I can replace them?
EDIT:
As requested, here is a mockup of an old DB entry and the style it should have in textile.
Before transformation:
Fixed for rev. 1.7.92.
<h4>Problems:</h4>
<ul>
<li>fixed.</li>
<li>fixed. New minimum 270x270</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed. Column types list is growing horizontally now.</li>
</ul>
After transformation:
Fixed for rev. 1.7.92.
h4.Problems:
* fixed.
* fixed. New minimum 270x270
* fixed.
* fixed.
* fixed.
* fixed. Column types list is growing horizontally now.
EDIT 2:
I tried to overwrite parts of the to_s method of the Nokogiri elements:
li.to_s["<li>"]=""
but that doesn't seem to be a valid lvalue (not that there is an error, it just doesn't do anything).

Here's the basis for such a transform:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ul>
<li>list item 1</li>
<li>list item 2</li>
</ul>
EOT
puts doc.to_html
doc.search('ul').each do |ul|
ul.search('li').each do |li|
li.replace("* #{ li.text.strip }")
end
ul.replace(ul.text)
end
puts doc.to_html
Running that outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><ul>
<li>list item 1</li>
<li>list item 2</li>
</ul></body></html>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>* list item 1
* list item 2
</body></html>
I didn't intend, or attempt, to make the first "item" have a leading carriage-return or line-feed. That's left as an exercise for the reader. Nor did I try to handle the <h4> tags or similar substitutions. From the answer code you should be able to figure out how to do it.
Also, I'm using Nokogiri::HTML to parse the HTML, which turns it into a full HTML document with the appropriate DOCTYPE header, <html> and <body> tags to mimic a full HTML document. That could be changed using Nokogiri::HTML::DocumentFragment.parse instead but wouldn't really make a difference in the output.

You may want to look at ClothRed, which is an HTML to Textile converter in Ruby. It hasn't been updated in a while, but it's simple and may be a good starting point for your own converter.
If you really want to use Nokogiri, you're writing a filter, so you may want to use the SAX interface.

You may want to try McBean (https://github.com/flavorjones/mcbean) [caveat: I'm the author of the gem, and it hasn't been updated in a while].
It's similar to ClothRed in spirit, but uses Nokogiri under the hood and actually transforms the document structure into output text. It supports substantial subset of Textile; and in fact I've used it successfully to convert wiki pages between wiki systems, as you're trying to do.

If anybody interested finds this later, another alternative is to use Pandoc. I've just did my first tests, and it seems almost sufficient, and it can do many more formats.

Related

Page Object Gem. Making accessors for elements without ids

Lets say I have a simple page that has less IDs than I'd like for testing
<div class="__panel_body">
<div class="__panel_header">Real Estate Rating</div>
<div class="__panel_body">
<div class="__panel_header">Property Rating Info</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">General Risks</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">Amenities</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
</div>
I'm using Jeff Morgan's Page Object gem and I want to make accessors for the edit links in any given section.
The challenge is that the panel headers differentiate what body I want to choose. Then I need to access the parent and get all links with class "icon.edit". Assume I can't change the HTML to solve this.
Here's a start
module RealEstateRatingPageFields
div(:general_risks_section, ....)
def general_risks_edit_links
general_risks_section_element.links(class: "icon.edit")
end
end
How do I get the general_risks_section accessor to work, though?
I want that to represent the parent div to the panel header with text 'General Risks'...
There are a number of ways to get the general risk section.
Using a Block
The accessors can take a block where you can more programatically describe how to locate the element. This allows you to locate a distinguishing element and then traverse the DOM to the element you actually want. In this case, you can locate the header with the matching text and navigate to its parent.
div(:general_risks_section) { div_element(class: '__panel_header', text: 'General Risks').parent }
Using XPath
While harder to read and write, you could also use an XPath locator. The concept and thought process is the same as using the block. The only benefit is that it reduces the number of element calls, which slightly improves performance.
div(:general_risks_section, xpath: './/div[#class="__panel_body"][./div[#class="__panel_header" and text() = "General Risks"]]')
The XPath is saying:
.//div # Find a div element that
[#class="__panel_body"] # Has the class "__panel_body" and
[./div[ # Contains a div element that
#class="__panel_header" and # Has the class "__panel_header" and
text() = "General Risks" # Has the text "General Risks"
]]
Using the Body Text
Given the HTML, you could also just locate the section directly based on its text.
div(:general_risks_section, class: '__panel_body', text: 'General Risks')
Note that this assumes that the HTML given was not simplified. If there are actually other text nodes, this probably would not be the best option.

Get content after header tag with Nokogiri

I am playing with Nokogiri just to learn it and am trying to write a little CL scraper. Right now I am trying to match up each State on the main page with the cities underneath. Below is a snippet of the HTML:
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
I can already pull out just this div class of "colmask" easy enough. But now I am just trying to get the UL directly after each h4, but can't find a way to do it so far. Suggestions?
You can get ul elements after h4 using following-sibling:
require 'nokogiri'
html = <<-EOF
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
EOF
doc = Nokogiri::HTML(html)
doc.xpath('//h4/following-sibling::ul').each do |node|
puts node.to_html
end
To select ul after an h4 with exact text:
puts doc.xpath("//h4[text()='Alabama']/following-sibling::ul")[0].to_html
I'd do something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
</ul>
EOT
states = doc.search('h4')
states_and_cities = states.map{ |state|
cities = state.next_element.search('li a')
[state.text, cities.map(&:text)]
}.to_h
At this point states_and_cities is a hash of arrays:
states_and_cities
# => {"Alabama"=>["auburn", "birmingham"],
# "Alaska"=>["anchorage / mat-su", "fairbanks"]}
If you're concerned about having a big structure, it'd be very easy to convert states to a hash where each state's name is a key, and the associated value is the state's node. Then, that node could be grabbed to find only the cities for the particular state.
However, if you're running this code to generate content for a web-page on the fly, then you're going about it wrong. The information for states and cities should be dumped into a database where it can be accessed much more quickly. Then you won't have to do it every time the page is generated.
Being kind and gentle to other sites is important; Research the HEAD HTTP request. It's your key to determining whether you should retrieve a page in full. Also, learn how to sniff the cache information from the HTTP header returned from a server. That tells you what your minimum refresh rate should be. Also, pay attention to the robots.txt file, which tells you what they consider safe for you to scrape; ignoring that can lead to being banned.

Extracting contents from a list split across different divs

Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.

Returning a list of <li> WebElements via find_element_by_xpath

So I am using a combination of Selenium and Python 2.7 (and if it matters the browser I am in is Firefox). I am new to XPath but it seems very useful for fetching WebElements.
I have the following HTML file that I am parsing through:
<html>
<head></head>
<body>
..
<div id="childItem">
<ul>
<li class="listItem"><img/><span>text1</span></li>
<li class="listItem"><img/><span>text2</span></li>
...
<li class="listItem"><img/><span>textN</span></li>
</ul>
</div>
</body>
</html>
Now I can use the following code to get a list of all the li elements:
root = element.find_element_by_xpath('..')
child = root.find_element_by_id('childDiv')
list = child.find_elements_by_css_selector('div.childDiv > ul > li.listItem')
I am wondering how I can do this in an XPath statement. I have tried a few statments but the most simple is:
list = child.find_element_by_xpath('li[#class="listItem"]')
But I always end up getting the error:
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"xpath","selector":"li[#class=\\"listItem\\"]"}';
As I do have a work around (the first three lines) this is not critical for me, but I would like to know what I am doing wrong.
You are missing the .// at the start of the xpath:
list = child.find_element_by_xpath('.//li[#class="listItem"]')
The .// means to search anywhere within the child element.

Building an HTML document with content from another

I'm have a document A and want to build a new one B using A's node values.
Given A looks like this...
<html>
<head></head>
<body>
<div id="section0">
<h1>Section 0</h1>
<div>
<p>Some <b>important</b> info here</p>
<div>Some unimportant info here</p>
</div>
<div>
<div id="section1">
<h1>Section 1</h1>
<div>
<p>Some <i>important</i> info here</p>
<div>Some unimportant info here</div>
</div>
<div>
</body>
</html>
When building a B document, I'm using method a.at_css("#section#{n} h1").text to grab the data from A's h1 tags like this:
require 'nokogiri'
a = Nokogiri::HTML(html)
Nokogiri::HTML::Builder.new do |doc|
...
doc.h1 a.at_css("#section#{n} h1").text
...
end
So there are three questions:
How do I grab the content of <p> tags preserving tags inside
<p>?
Currently, once I hit a.at_css("#section#{n} p").text it
returns a plain text, which is not what's needed.
If, instead of .text I hit .to_html or .inner_html, the html appears escaped. So I get, for example, <p> instead of <p>.
Is there any known true way of assigning nodes at the document building stage? So that I wouldn't dance with text method at all? I.e. how do I assign doc.h1 node with value of a.at_css("#section#{n} h1") node at building stage?
What's the profit of Nokogiri::Builder.with(...) method? I wonder if I can get use of it...
How do I grab the content of <p> tags preserving tags inside <p>?
Use .inner_html. The entities are not escaped when accessing them. They will be escaped if you do something like builder.node_name raw_html. Instead:
require 'nokogiri'
para = Nokogiri.HTML( '<p id="foo">Hello <b>World</b>!</p>' ).at('#foo')
doc = Nokogiri::HTML::Builder.new do |d|
d.body do
d.div(id:'content') do
d.parent << para.inner_html
end
end
end
puts doc.to_html
#=> <body><div id="content">Hello <b>World</b>!</div></body>
Is there any known true way of assigning nodes at the document building stage?
Similar to the above, one way is:
puts Nokogiri::HTML::Builder.new{ |d| d.body{ d.parent << para } }.to_html
#=> <body><p id="foo">Hello <b>World</b>!</p></body>
Voila! The node has moved from one document to the other.
What's the profit of Nokogiri::Builder.with(...) method?
That's rather unrelated to the rest of your question. As the documentation says:
Create a builder with an existing root object. This is for use when you have an existing document that you would like to augment with builder methods. The builder context created will start with the given root node.
I don't think it would be useful to you here.
In general, I find the Builder to be convenient when writing a large number of custom nodes from scratch with a known hierarchy. When not doing that you may find it simpler to just create a new document and use DOM methods to add nodes as appropriate. It's hard to tell how much hard-coded nodes/hierarchy your document will have versus procedurally created.
One other, alternative suggestion: perhaps you should create a template XML document and then augment that with details from the other, scraped HTML?

Resources