Currently I'm looping over table rows and getting values from a td, putting them in a sorted hash identified by a value in a sibling td:
Ruby snippet
#counts = Hash.new
agent.page.search('.child').each do |child|
#counts[child.css('td')[0].text.strip!] = child.css('td')[1].text.gsub(/,/,'').to_i
end
puts #counts.sort_by{|k,v| v}.reverse.to_h
HTML structure
<tr class="parent">
<td class="info">Type</td>
<td>12,000</td>
</tr>
<tr class="child">
<td class="info">Sub Type</td>
<td>9,000</td>
</tr>
<tr class="child">
<td class="info">Sub Type</td>
<td>3,000</td>
</tr>
<tr class="parent">
<td class="info">Type</td>
<td>11,000</td>
</tr>
<tr class="child">
<td class="info">Sub Type</td>
<td>11,000</td>
</tr>
Now I would like to change the hash keys, by concatenating them with the text value in a td belonging to the parent tr. So in the above HTML structure, instead of "Sub Type" => 9000, "Sub Type" => 3000, etc. I would like to get "Type Sub Type" => 9000, "Type Sub Type" => 3000, etc.
How do I get the first preceding sibling with a certain class, when the number of siblings is unknown?
You can look at this a different way, loop through all tr elements (parent and child), keep the last found parent type and then concatenate the last parent type when you get to a child.
#counts = Hash.new
parent = nil
agent.page.search('.parent, .child').each do |node|
type = node.css('td')[0].text.strip
value = node.css('td')[1].text.gsub(/,/, '').to_i
if node['class'].include? 'parent'
parent = type
else
#counts["#{parent} #{type}"] = value
end
end
puts #counts.sort_by{|k,v| v}.reverse.to_h
Also, hashes are by nature an unsorted data structure. If you want to retain order, then your best bet would be an array of tuples. In other words, [['Type Sub Type', 12000], ['Type Sub Type', 11000], ..., ['Type Sub Type', 3000]]. Just remove the .t_h at the end of your last line to get that kind of result.
Related
Is there a way to put No Url Foud in a blank or missing anchor tag.
The reason of asking this is that the textnode output 50 textnode but the url only output 47 as some of the anchor is missin or not availble, causing the next list to colaps and completely ruin the list
see the screenshots td tag|Td list
I could get the textNode and the attributes the only problem here is some of the td list has a missing anchor causing the other list to collapse
<table>
<tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode With No Anchor</td>
</tr> <tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode With No Anchor</td>
</tr>
</table>
company_name = page.css("td:nth-child(2)")
company_name.each do |line|
c_name = line.text.strip
# this will output 50 titles
puts c_name
end
directory_url = page.css("td:nth-child(1) a")
directory_url.each do |line|
dir_url = line["href"]
# this will output 47 Urls since some list has no anchor tag.
puts dir_url
end
You can't find things that aren't there. You have to find things that are there, and then search within them for elements that may or may not be present.
Like:
directory = page.css("td:nth-child(1)")
directory.each do |e|
anchor = e.css('a')
puts anchor.any? ? anchor[0]['href'] : '(No URL)'
end
How do you traverse up to a certain found element and then continue to the next found item? In my example I am trying to search for the first element, grab the text, and then continue until I find the next tag or until I hit a specific tag. The reason I need to also take into account the tag is because I want to do something there.
Html
<table border=0>
<tr>
<td width=180>
<font size=+1><b>apple</b></font>
</td>
<td>Description of an apple</td>
</tr>
<tr>
<td width=180>
<font size=+1><b>banana</b></font>
</td>
<td>Description of a banana</td>
</tr>
<tr>
<td><img vspace=4 hspace=0 src="common/dot_clear.gif"></td>
</tr>
...Then this repeats itself in a similar format
Current scrape.rb
#...
document.at_css("body").traverse do |node|
#if <font> is found
#puts text in font
#else if <img> is found then
#puts img src and continue loop until end of document
end
Thank you!
Interesting. You basically want to traverse through all the children in your tree and perform some operations on basis of the nodes obtained.
So here is how we can do that:
#Acquiring dummy page
page = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/Ruby_%28programming_language%29'))
Now, if you want to start traversing all body elements, we can employ XPath for our rescue. XPath expression: //body//* will give back all the children and grand-children in body.
This would return the array of elements with class Nokogiri::XML::Element
page.xpath('//body//*')
page.xpath('//body//*').first.node_name
#=> "div"
So, you can now traverse on that array and perform your operations:
page.xpath('//body//*').each do |node|
case node.name
when 'div' then #do this
when 'font' then #do that
end
end
Something like this perhaps:
document.at_css("body").traverse do |node|
if node.name == 'font'
puts node.content
elsif node.name == 'img'
puts node.attribute("src")
end
This might be merely a syntax question.
I am unclear how to match only table rows whose id begins with rowId_
agent = Mechanize.new
pageC1 = agent.get("/customStrategyScreener!list.action")
The table has class=tableCellDT.
pageC1.search('table.tableCellDT tr[#id=rowId_]') # parses OK but returns 0 rows since rowId_ is not matched exactly.
pageC1.search('table.tableCellDT tr[#id=rowId_*]') # Throws an error since * is not treated like a wildcard string match
EXAMPLE HTML:
<table id="row" cellpadding="5" class="tableCellDT" cellspacing="1">
<thead>
<tr>
<th class="tableHeaderDT">#</th>
<th class="tableHeaderDT sortable">
Screener</th>
<th class="tableHeaderDT sortable">
Strategy</th>
<th class="tableHeaderDT"> </th></tr></thead>
<tbody>
<tr id="rowId_BullPut" class="odd">
<td> 1 </td>
<td> Bull</td>
<td></td>
<td>Edit
Delete
View
</td></tr>
NOTE
pageC1 is a Mechanize::Page object, not a Nokogiri anything. Sorry that wasn't clear at first.
Mechanize::Page doesn't have #css or #xpath methods, but a Nokogiri doc can be extracted from it (used internally anyway).
To get the tr elements that have an id starting with "rowId_":
pageC1.search('//tr[starts-with(#id, "rowId_")]')
You want either the CSS3 attribute starts-with selector:
pageC1.css('table.tableCellDT tr[id^="rowId_"]')
or the XPath starts-with() function:
pageC1.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]')
Although the Nokogiri Node#search method will intelligently pick between CSS or XPath selector syntax based on what you wrote, that does not mean that you can mix both CSS and XPath selector syntax in the same query.
In action:
>> require 'nokogiri'
#=> true
>> doc = Nokogiri.HTML <<ENDHTML; true #hide output from IRB
">> <table class="foo"><tr id="rowId_nonono"><td>Nope</td></tr></table>
">> <table class="tableCellDT">
">> <tr id="rowId_yesyes"><td>Yes1</td></tr>
">> <tr id="rowId_andme2"><td>Yes2</td></tr>
">> <tr id="rowIdNONONO"><td>Needs underscore</td></tr>
">> </table>
">> ENDHTML
#=> true
>> doc.css('table.tableCellDT tr[id^="rowId_"]').map(&:text)
#=> ["Yes1", "Yes2"]
>> doc.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]').map(&:text)
#=> ["Yes1", "Yes2"]
Thanks to
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-css
and the answers above, here is the final code that solves my problem of getting just the rows I need, and then reading only certain information from each one:
pageC1.search('//tr[starts-with(#id, "rowId_")]').each do |row|
# Read the string after _ in rowId_, part of the "id" in <tr>
rid = row.attribute("id").text.split("_")[1] # => "BullPut"
# Get the URL of the 3rd <a> link in <td> cell 4
link = row.css("td[4] a[3]")[0].attributes["href"].text # => "link3?model.itemId=2262&model.source=list"
end
I am trying to automate a block appearing on the website and comparing its content through CMS table.
The issue is I have managed to automate the block appearing on the UI but when I login as admin and try to save the content of the table in an array using iteration there where I fail to do it.
<table id="nodequeue-dragdrop" class="nodequeue-dragdrop sticky-enabled tabledrag-processed sticky-table">
<thead class="tableHeader-processed">
<tbody>
<tr class="draggable odd">
<td>
<a class="tabledrag-handle" href="#" title="Drag to re-order">
New Text 1
</td>
<td>
<td>2012-06-06 10:24</td>
<td style="display: none;">
<td>
<td>
<td class="position">1</td>
</tr>
<tr class="draggable even">
<td>
<a class="tabledrag-handle" href="#" title="Drag to re-order">
Text 2
</td>
<td>
<td>2012-06-06 10:29</td>
<td style="display: none;">
<td>
<td>
<td class="position">2</td>
</tr>
<tr class="draggable odd">
<td>
<a class="tabledrag-handle" href="#" title="Drag to re-order">
This is Text 3
</td>
<td>
<td>2012-06-05 12:55</td>
<td style="display: none;">
<td>
<td>
<td class="position">3</td>
</tr>
The code that I am using is
#text = Array.new
x = 1
y = 0
until x == 10
y = x -1
until y == x
#text[y] = #browser.table(:id,'nodequeue-dragdrop').tbody.row{x}.cell{1}.link(:href =>/car-news/).text
puts #text[y]
y=y+1
end
x=x+1
end
The problem is the scripts runs successfully but even though i have set an iteration the script only reads the 1st element and displays it text and does not goto the 2nd 3rd...and so on elements.
Justin is headed the right direction with using ruby's built in methods for iterating over collections. But consider this, If I am reading your code right, you know you are after the text from specific links, so why iterate over the rows when you could just make a collection of matching links?
link_text_array = Array.new
#browser.table(:id,'nodequeue-dragdrop').links(:href => /car-news/) do |link|
link_text_array << link.text
end
There are built in methods to iterate over the rows/columns. Try this:
table_array = Array.new
table = #browser.table(:id,'nodequeue-dragdrop')
table.rows.each do |row|
row_array = Array.new
row.cells.each do |cell|
row_array << cell.text
end
table_array << row_array
end
puts table_array # This will be an array (row) of arrays (column)
Found the solution to my problem
instead of rows{} I used tds{}
i.e I changed the code to
#text[y] = #browser.table(:id,'nodequeue-dragdrop').tbody.tds{x}.cell{1}.link(:href =>/car-news/).text
Its working as I want it to..
I have an html like this:
...
<table>
<tbody>
...
<tr>
<th> head </th>
<td> td1 text<td>
<td> td2 text<td>
...
</tr>
</tbody>
<tfoot>
</tfoot>
</table>
...
I'm using Nokogiri with ruby. I want traverse through each row and get the text of th and corresponding td into an hash.
require "nokogiri"
#Parses your HTML input
html_data = "...stripped HTML markup code..."
html_doc = Nokogiri::HTML html_data
#Iterates over each row in your table
#Note that you may need to clarify the CSS selector below
result = html_doc.css("table tr").inject({}) do |all, row|
#Modify if you need to collect only the first td, for example
all[row.css("th").text] = row.css("td").text
end
I didn't run this code, so I'm not absolutely sure but the overall idea should be right:
html_doc = Nokogiri::HTML("<html> ... </html>")
result = []
html_doc.xpath("//tr").each do |tr|
hash = {}
tr.children.each do |node|
hash[node.node_name] = node.content
end
result << hash
end
puts result.inspect
See the docs for more info: http://nokogiri.org/Nokogiri/XML/Node.html