I'm trying to return exact XPATH query expressions so I can datamine a site with rapidminer.
I need a query to isolate each line individually:
Wed 7/11/2012
TROLL
9999999999999
07.11.12
CONNOTE FILE LODGED
Tue 20/11/2012 1:12 PM
So far all I have is //td[#class='select']/text()
Note: The values will change so the query needs to be location specific.
What would the six separate queries be for each of the values?
<tr>
<td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
Wed 7/11/2012<br>
TROLL
</td>
<td class="select" align="center" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
9999999999999
<br>07.11.12
</td>
<td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
CONNOTE FILE LODGED <br>
Tue 20/11/2012 1:12 PM
</td>
</tr>
</table>
Using the Ruby library Nokogiri (which stands on top of libxml2, implementing XPath 1.0) to test:
XPATHS = %w{
//tr/td[1]/text()[1]
//tr/td[1]/text()[2]
//tr/td[2]/text()[1]
//tr/td[2]/text()[2]
//tr/td[3]/text()[1]
//tr/td[3]/text()[2]
}
require 'nokogiri'
d = Nokogiri.HTML(html)
XPATHS.each{ |expression| p d.at_xpath(expression).content }
#=> "\n Wed 7/11/2012"
#=> "\n TROLL\u00A0\n\n "
#=> "\n 9999999999999\n "
#=> "07.11.12\n\n \u00A0\n "
#=> "\n\n\n\n\n CONNOTE FILE LODGED "
#=> "\n Tue 20/11/2012 1:12 PM\n \u00A0\n\n\n\n\u00A0\n "
As you can see, the text nodes contain a lot of extra leading and trailing whitespace that you probably want to strip out. We can strip this by using normalize-space:
XPATHS = %w{
normalize-space(//tr/td[1]/text()[1])
normalize-space(//tr/td[1]/text()[2])
normalize-space(//tr/td[2]/text()[1])
normalize-space(//tr/td[2]/text()[2])
normalize-space(//tr/td[3]/text()[1])
normalize-space(//tr/td[3]/text()[2])
}
XPATHS.each{ |expression| p d.xpath(expression) }
#=> "Wed 7/11/2012"
#=> "TROLL\u00A0"
#=> "9999999999999"
#=> "07.11.12 \u00A0"
#=> "CONNOTE FILE LODGED"
#=> "Tue 20/11/2012 1:12 PM \u00A0 \u00A0"
Related
I am trying to scan rows in a HTML table using partial href xpath and perform further tests with that row's other column values.
<div id = "blah">
<table>
<tr>
<td>link</td>
<td>29 33 485</td>
<td>45.2934,00 EUR</td>
</tr>
<tr>
<td>link</td>
<td>22 93 485</td>
<td>38.336.934,123 EUR</td>
</tr>
<tr>
<td>link</td>
<td>394 27 3844</td>
<td>3.485,2839 EUR</td>
</tr>
</table>
</div>
In cucumber-jvm step definition, I performed this much easily like below (I am more comfortable using Ruby)
#Given("^if there are...$")
public void if_there_are...() throws Throwable {
...
...
baseTable = driver.findElement(By.id("blah"));
tblRows = baseTable.findElements(By.tagName("tr"));
for(WebElement row : tblRows) {
if (row.findElements(By.xpath(".//a[contains(#href,'key=HONDA')]")).size() > 0) {
List<WebElement> col = row.findElements(By.tagName("td"));
tblData dummyThing = new tblData();
dummyThing.col1 = col.get(0).getText();
dummyThing.col2 = col.get(1).getText();
dummyThing.col3 = col.get(2).getText();
dummyThing.col4 = col.get(3).getText();
dummyThings.add(dummyThing);
}
}
I am clueless here
page.find('#blah').all('tr').each { |row|
# if row matches xpath then grab that complete row
# so that other column values can be verified
# I am clueless from here
row.find('td').each do { |c|
}
page.find('#blah').all('tr').find(:xpath, ".//a[contains(#href,'key=HONDA')]").each { |r|
#we got the row that matches xpath, let us do something
}
}
I think you are looking to do:
page.all('#blah tr').each do |tr|
next unless tr.has_selector?('a[href*="HONDA"]')
# Do stuff with trs that meet the href requirement
puts tr.text
end
#=> link 29 33 485 45.2934,00 EUR
#=> link 22 93 485 38.336.934,123 EUR
This basically says to:
Find all trs in the element with id 'blah'
Iterate through each of the trs
If the tr does not have a link that has a href containing HONDA, ignore it
Otherwise, output the text of the row (that matches the criteria). You could do whatever you need with the tr here.
You could also use xpath to collapse the above into a single statement. However, I do not think it is as readable:
page.all(:xpath, '//div[#id="blah"]//tr[.//a[contains(#href, "HONDA")]]').each do |tr|
# Do stuff with trs that meet the href requirement
puts tr.text
end
#=> link 29 33 485 45.2934,00 EUR
#=> link 22 93 485 38.336.934,123 EUR
Here is an example of how to inspect each matching row's link url and column values:
page.all('#blah tr').each do |tr|
next unless tr.has_selector?('a[href*="HONDA"]')
# Do stuff with trs that meet the href requirement
href = tr.find('a')['href']
column_value_1 = tr.all('td')[1].text
column_value_2 = tr.all('td')[2].text
puts href, column_value_1, column_value_2
end
#=> file:///C:/Scripts/Misc/Programming/Capybara/afile?key=HONDA
#=> 29 33 485
#=> 45.2934,00 EUR
#=> file:///C:/Scripts/Misc/Programming/Capybara/afile?key=HONDA
#=> 22 93 485
#=> 38.336.934,123 EUR
If you need the table row, you could probably use something like the ancestor method:
anchors = page.all('#blah a[href*="HONDA"]')
trs = anchors.map { |anchor| anchor.ancestor('tr') }
Given an html file:
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
...more divs
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
...more divs
</div
Using these SO posts as reference:
How do I integrate these two conditions block codes to mine in Ruby?
and
How to understand this Arrays and loops in Ruby?
My code:
require 'nokogiri'
require 'pp'
require 'open-uri'
data_file = 'site.htm'
file = File.open(data_file, 'r')
html = open(file)
page = Nokogiri::HTML(html)
page.encoding = 'utf-8'
rows = page.xpath('//div[#class="NormalMid"]')
details = rows.collect do |row|
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
data = row.children[0].children[0].to_s.strip
links = link.collect {|item| item.at_xpath('#href').to_s.strip}
detail[data.to_sym] = links
end
detail
end
details.reject! {|d| d.empty?}
pp details
The output:
[{:"Data 1:"=>
["http://www.site.com/data/1",
"http://www.site.com/data/2"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20",
"http://www.site.com/data/21",
"http://www.site.com/data/22",
"http://www.site.com/data/20",]},
...
}]
Everything is going good, exactly what I wanted.
BUT if you change these lines of code:
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
to:
detail = {}
[
[row.children.first.element_children],
].each do |link|
I get the output of
[{:"Data 1:"=>
["http://www.site.com/data/1"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20"]},
...
}]
Only the first anchor href is stored in the array.
I just need some clarification on why its behaving that way because the argument part in the argument list is not being used, I figure I didn't need it there. But my program doesn't work correctly if I delete the corresponding row.children.first.element_children as well.
What is going on in the [[obj,obj],].each do block? I just started ruby a week ago, and I'm still getting used to the syntax, any help will be appreciated. Thank You :D
EDIT
rows[0].children.first.element_children[0] will have the output
Nokogiri::XML::Element:0xcea69c name="a" attributes=[#<Nokogiri::XML::Attr:0xcea648
name="href" value="http://www.site.com/data/1">] children[<Nokogiri::XML::Text:0xcea1a4
"1">]>
puts rows[0].children.first.element_children[0]
1
You made your code overly complicated. Looking at your code,it seems you are trying to get something like below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eotl
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
</div
eotl
rows = doc.xpath("//div[#class='NormalMid']/span[#class='style-span']")
val = rows.map do |row|
[row.at_xpath("./text()").to_s.tr('"','').strip,row.xpath(".//#href").map(&:to_s)]
end
Hash[val]
# => {"Data 1:"=>["http://site.com/data/1", "http://site.com/data/2"],
# "Data 20:"=>
# ["http://site.com/data/20",
# "http://site.com/data/21",
# "http://site.com/data/22",
# "http://site.com/data/23"]}
What is going on in the [[obj,obj],].each do block?
Look the below 2 parts:
[[1],[4,5]].each do |a|
p a
end
# >> [1]
# >> [4, 5]
[[1,2],[4,5]].each do |a,b|
p a, b
end
# >> 1
# >> 2
# >> 4
# >> 5
This might be merely a syntax question.
I am unclear how to match only table rows whose id begins with rowId_
agent = Mechanize.new
pageC1 = agent.get("/customStrategyScreener!list.action")
The table has class=tableCellDT.
pageC1.search('table.tableCellDT tr[#id=rowId_]') # parses OK but returns 0 rows since rowId_ is not matched exactly.
pageC1.search('table.tableCellDT tr[#id=rowId_*]') # Throws an error since * is not treated like a wildcard string match
EXAMPLE HTML:
<table id="row" cellpadding="5" class="tableCellDT" cellspacing="1">
<thead>
<tr>
<th class="tableHeaderDT">#</th>
<th class="tableHeaderDT sortable">
Screener</th>
<th class="tableHeaderDT sortable">
Strategy</th>
<th class="tableHeaderDT"> </th></tr></thead>
<tbody>
<tr id="rowId_BullPut" class="odd">
<td> 1 </td>
<td> Bull</td>
<td></td>
<td>Edit
Delete
View
</td></tr>
NOTE
pageC1 is a Mechanize::Page object, not a Nokogiri anything. Sorry that wasn't clear at first.
Mechanize::Page doesn't have #css or #xpath methods, but a Nokogiri doc can be extracted from it (used internally anyway).
To get the tr elements that have an id starting with "rowId_":
pageC1.search('//tr[starts-with(#id, "rowId_")]')
You want either the CSS3 attribute starts-with selector:
pageC1.css('table.tableCellDT tr[id^="rowId_"]')
or the XPath starts-with() function:
pageC1.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]')
Although the Nokogiri Node#search method will intelligently pick between CSS or XPath selector syntax based on what you wrote, that does not mean that you can mix both CSS and XPath selector syntax in the same query.
In action:
>> require 'nokogiri'
#=> true
>> doc = Nokogiri.HTML <<ENDHTML; true #hide output from IRB
">> <table class="foo"><tr id="rowId_nonono"><td>Nope</td></tr></table>
">> <table class="tableCellDT">
">> <tr id="rowId_yesyes"><td>Yes1</td></tr>
">> <tr id="rowId_andme2"><td>Yes2</td></tr>
">> <tr id="rowIdNONONO"><td>Needs underscore</td></tr>
">> </table>
">> ENDHTML
#=> true
>> doc.css('table.tableCellDT tr[id^="rowId_"]').map(&:text)
#=> ["Yes1", "Yes2"]
>> doc.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]').map(&:text)
#=> ["Yes1", "Yes2"]
Thanks to
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-css
and the answers above, here is the final code that solves my problem of getting just the rows I need, and then reading only certain information from each one:
pageC1.search('//tr[starts-with(#id, "rowId_")]').each do |row|
# Read the string after _ in rowId_, part of the "id" in <tr>
rid = row.attribute("id").text.split("_")[1] # => "BullPut"
# Get the URL of the 3rd <a> link in <td> cell 4
link = row.css("td[4] a[3]")[0].attributes["href"].text # => "link3?model.itemId=2262&model.source=list"
end
I have an html like this:
...
<table>
<tbody>
...
<tr>
<th> head </th>
<td> td1 text<td>
<td> td2 text<td>
...
</tr>
</tbody>
<tfoot>
</tfoot>
</table>
...
I'm using Nokogiri with ruby. I want traverse through each row and get the text of th and corresponding td into an hash.
require "nokogiri"
#Parses your HTML input
html_data = "...stripped HTML markup code..."
html_doc = Nokogiri::HTML html_data
#Iterates over each row in your table
#Note that you may need to clarify the CSS selector below
result = html_doc.css("table tr").inject({}) do |all, row|
#Modify if you need to collect only the first td, for example
all[row.css("th").text] = row.css("td").text
end
I didn't run this code, so I'm not absolutely sure but the overall idea should be right:
html_doc = Nokogiri::HTML("<html> ... </html>")
result = []
html_doc.xpath("//tr").each do |tr|
hash = {}
tr.children.each do |node|
hash[node.node_name] = node.content
end
result << hash
end
puts result.inspect
See the docs for more info: http://nokogiri.org/Nokogiri/XML/Node.html
I have a text blob field in a MySQL column that contains HTML. I have to change some of the markup, so I figured I'll do it in a ruby script. Ruby is irrelevant here, but it would be nice to see an answer with it. The markup looks like the following:
<h5>foo</h5>
<table>
<tbody>
</tbody>
</table>
<h5>bar</h5>
<table>
<tbody>
</tbody>
</table>
<h5>meow</h5>
<table>
<tbody>
</tbody>
</table>
I need to change just the first <h5>foo</h5> block of each text to <h2>something_else</h2> while leaving the rest of the string alone.
Can't seem to get the proper PCRE regex, using Ruby.
# The regex literal syntax using %r{...} allows / in your regex without escaping
new_str = my_str.sub( %r{<h5>[^<]+</h5>}, '<h2>something_else</h2>' )
Using String#sub instead of String#gsub causes only the first replacement to occur. If you need to dynamically choose what 'foo' is, you can use string interpolation in regex literals:
new_str = my_str.sub( %r{<h5>#{searchstr}</h5>}, "<h2>#{replacestr}</h2>" )
Then again, if you know what 'foo' is, you don't need a regex:
new_str = my_str.sub( "<h5>searchstr</h5>", "<h2>#{replacestr}</h2>" )
or even:
my_str[ "<h5>searchstr</h5>" ] = "<h2>#{replacestr}</h2>"
If you need to run code to figure out the replacement, you can use the block form of sub:
new_str = my_str.sub %r{<h5>([^<]+)</h5>} do |full_match|
# The expression returned from this block will be used as the replacement string
# $1 will be the matched content between the h5 tags.
"<h2>#{replacestr}</h2>"
end
Whenever I have to parse or modify HTML or XML I reach for a parser. I almost never bother with regex or instring unless it's absolutely a no-brainer.
Here's how to do it using Nokogiri, without any regex:
text = <<EOT
<h5>foo</h5>
<table>
<tbody>
</tbody>
</table>
<h5>bar</h5>
<table>
<tbody>
</tbody>
</table>
<h5>meow</h5>
<table>
<tbody>
</tbody>
</table>
EOT
require 'nokogiri'
fragment = Nokogiri::HTML::DocumentFragment.parse(text)
print fragment.to_html
fragment.css('h5').select{ |n| n.text == 'foo' }.each do |n|
n.name = 'h2'
n.content = 'something_else'
end
print fragment.to_html
After parsing, this is what Nokogiri has returned from the fragment:
# >> <h5>foo</h5>
# >> <table><tbody></tbody></table><h5>bar</h5>
# >> <table><tbody></tbody></table><h5>meow</h5>
# >> <table><tbody></tbody></table>
This is after running:
# >> <h2>something_else</h2>
# >> <table><tbody></tbody></table><h5>bar</h5>
# >> <table><tbody></tbody></table><h5>meow</h5>
# >> <table><tbody></tbody></table>
Use String.gsub with the regular expression <h5>[^<]+<\/h5>:
>> current = "<h5>foo</h5>\n <table>\n <tbody>\n </tbody>\n </table>"
>> updated = current.gsub(/<h5>[^<]+<\/h5>/){"<h2>something_else</h2>"}
=> "<h2>something_else</h2>\n <table>\n <tbody>\n </tbody>\n </table>"
Note, you can test ruby regular expression comfortably in your browser.