Get element after another elements with Hpricot and Ruby - ruby

I have the following HTML:
<ul class="filtering_new" width="50%">
<li class="filter">1</li>
<li class="filter">2</li>
<script>Alert('1');</script>
<li class="filter">3</li>
</ul>
How can I get li with inner_html = 3?
I tried like this:
page.search("//ul.filtering_new").each do |list|
puts list.search("li").size
end
where page is the HTML document.
size = 2, but it should be 3.
I tried to do like in manual https://github.com/hpricot/hpricot/wiki/hpricot-challenge
but I cannot even find <script.
list.search("script")
returns nothing.

I don't think you can mixup XPath with CSS Selector when using search. In your example you do. Try:
//ul[#class='filtering_new']
or
ul.filtering_new
inside search.

Most XML/HTML parsing in Ruby uses Nokogiri these days, so I'll recommend that parser. However, both Hpricot and Nokogiri support XPath and CSS, so they are fairly interchangeable.
I'd go about it this way:
html = <<EOT
<ul class="filtering_new" width="50%">
<li class="filter">1</li>
<li class="filter">2</li>
<script>Alert('1');</script>
<li class="filter">3</li>
</ul>
EOT
require 'nokogiri'
doc = Nokogiri::HTML(html)
li = doc.search('//li[#class="filter"]').select{ |n| n.text.to_i == 3 }
li # => [#<Nokogiri::XML::Element:0x8053fc84 name="li" attributes=[#<Nokogiri::XML::Attr:0x8053fb6c name="class" value="filter">] children=[#<Nokogiri::XML::Text:0x80546f98 "3">]>]
That finds the candidate nodes, then returns them as a NodeSet to be iterated over, where they are selected/rejected based on the node's text.
li = doc.search('//li[text() = "3"]')
li # => [#<Nokogiri::XML::Element:0x8053fc84 name="li" attributes=[#<Nokogiri::XML::Attr:0x8053fb6c name="class" value="filter">] children=[#<Nokogiri::XML::Text:0x80546f98 "3">]>]
That offloads more of the comparison to the underlying libXML library, where it runs a lot faster.

Related

How to get text from list items with Mechanize?

<div class="carstd">
<ul>
<li class="cars">"Car 1"</li>
<li class="cars">"Car 2"</li>
<li class="cars">"Car 3"</li>
<li class="cars">"Car 4"</li>
</ul>
</div>
I want strip the text from each list item with mechanize and print it out. I've tried
puts page.at('.cars').text.strip but it only gets the first item. I've also tried
page.links.each do |x|
puts x.at('.cars').text.strip
end
But I get an error undefined method 'at' for #<Mechanize::Page::Link:0x007fe7ea847810>.
There's no links there. Links are a elements that get converted into special Mechanize objects.
You want something like:
page.search('li.cars').text # the text of all the li's mashed together as a string
or
page.search('li.cars').map{|x| x.text} # the text of each `li` as an array of strings

Using Nokogiri to find element before another element

I have a partial HTML document:
<h2>Destinations</h2>
<div>It is nice <b>anywhere</b> but here.
<ul>
<li>Florida</li>
<li>New York</li>
</ul>
<h2>Shopping List</h2>
<ul>
<li>Booze</li>
<li>Bacon</li>
</ul>
On every <li> item, I want to know the category the item is in, e.g., the text in the <h2> tags.
This code does not work, but this is what I'm trying to do:
#page.search('li').each do |li|
li.previous('h2').text
end
Nokogiri allows you to use xpath expressions to locate an element:
categories = []
doc.xpath("//li").each do |elem|
categories << elem.parent.xpath("preceding-sibling::h2").last.text
end
categories.uniq!
p categories
The first part looks for all "li" elements, then inside, we look for the parent (ul, ol), the for an element before (preceding-sibling) which is an h2. There can be more than one, so we take the last (ie, the one closest to the current position).
We need to call "uniq!" as we get the h2 for each 'li' (as the 'li' is the starting point).
Using your own HTML example, this code output:
["Destinations", "Shopping List"]
You are close.
#page.search('li').each do |li|
category = li.xpath('../preceding-sibling::h2').text
puts "#{li.text}: category #{category}"
end
The code:
categories = []
Nokogiri::HTML("yours HTML here").css("h2").each do |category|
categories << category.text
end
The result:
categories = ["Destinations", "Shopping List"]

Parsing webpage with some html tags using Nokogiri

For example:
content=Nokogiri::HTML(open(url)).at_css(".appwindow").text
This example parse text from .appwindow (only text).
How can I parse this text with <p> tag?
I think you want to find either the full HTML of the first element that has an appwindow class, or perhaps the inner HTML. If so:
require 'nokogiri'
html = Nokogiri::HTML <<ENDHTML
<div id='menu'>menu</div>
<div class='appwindow'><p>Hello <b>World</b>!</p></div>
ENDHTML
puts html.at_css('.appwindow').text
#=> Hello World!
puts html.at_css('.appwindow').to_html
#=> <div class="appwindow"><p>Hello <b>World</b>!</p></div>
puts html.at_css('.appwindow').inner_html
#=> <p>Hello <b>World</b>!</p>
See the list of methods on Nokogiri::XML::Node for other options available to you.

How to select nodes by matching text

If I have a bunch of elements like:
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>
Is there a built-in method in Nokogiri that would get me all p elements that contain the text "Apple"? (The example element above would match, for instance).
Nokogiri can do this (now) using jQuery extensions to CSS:
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
doc.at('p:contains("bar")').text.strip
=> "bar"
Here is an XPath that works:
require 'nokogiri'
doc = Nokogiri::HTML(DATA)
p doc.xpath('//li[contains(text(), "Apple")]')
__END__
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>
You can also do this very easily with Nikkou:
doc.search('p').text_includes('bar')
Try using this XPath:
p = doc.xpath('//p[//*[contains(text(), "Apple")]]')

Best way to parse a table in Ruby

I'd like to parse a simple table into a Ruby data structure. The table looks like this:
alt text http://img232.imageshack.us/img232/446/picture5cls.png http://img232.imageshack.us/img232/446/picture5cls.png
Edit: Here is the HTML
and I'd like to parse it into an array of hashes. E.g.,:
schedule[0]['NEW HAVEN'] == '4:12AM'
schedule[0]['Travel Time In Minutes'] == '95'
Any thoughts on how to do this? Perl has HTML::TableExtract, which I think would do the job, but I can't find any similar library for Ruby.
You might like to try Hpricot (gem install hpricot, prepend the usual sudo for *nix systems)
I placed your HTML into input.html, then ran this:
require 'hpricot'
doc = Hpricot.XML(open('input.html'))
table = doc/:table
(table/:tr).each do |row|
(row/:td).each do |cell|
puts cell.inner_html
end
end
which, for the first row, gives me
<span class="black">12:17AM </span>
<span class="black">
</span>
<span class="black">1:22AM </span>
<span class="black">
</span>
<span class="black">65</span>
<span class="black">TRANSFER AT STAMFORD (AR 1:01AM & LV 1:05AM) </span>
<span class="black">
N
</span>
So already we're down to the content of the TD tags. A little more work and you're about there.
(BTW, the HTML looks a little malformed: you have <th> tags in <tbody>, which seems a bit perverse: <tbody> is fairly pointless if it's just going to be another level within <table>. It makes much more sense if your <tr><th>...</th></tr> stuff is in a separate <thead> section within the table. But it may not be "your" HTML, of course!)
In case there isn't a library to do that for ruby, here's some code to get you started writing this yourself:
require 'nokogiri'
doc=Nokogiri("<table><tr><th>la</th><th><b>lu</b></th></tr><tr><td>lala</td><td>lulu</td></tr><tr><td><b>lila</b></td><td>lolu</td></tr></table>")
header, *rest = (doc/"tr").map do |row|
row.children.map do |c|
c.text
end
end
header.map! do |str| str.to_sym end
item_struct = Struct.new(*header)
table = rest.map do |row|
item_struct.new(*row)
end
table[1].lu #=> "lolu"
This code is far from perfect, obviously, but it should get you started.

Resources