Say I have:
<div class="amt" id="displayFare-1_69-61-0" style="">
<div class="per">per person</div>
<div class="per" id="showTotalSubIndex-1_69-61-0" style="">Total $334</div>
$293
</div>
I want to grab just the $334. It will always have "Total $" but the id showTotalSubIndex... will be dynamic so I can't use that.
You can use a nokogiri xpath expression to iterate over all the div nodes
and scan the string for the 'Total $' Prefix like this
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML.parse( open( "test.xml" ))
doc.xpath("//div/text()").each{ |t|
tmp = t.to_str.strip
puts tmp[7..-1] if tmp.index('Total $') == 0
}
Rather than finding the text:
html = Nokogiri::HTML(html)
html.css("div.amt").children[1].text.gsub(/^Total /, '')
I assume here that the HTML is structured in such a way that the second child of any div.amt tag is the value that you're after, and then we'll just grab the text of that and gsub it.
Both of these work:
require 'nokogiri'
doc = Nokogiri::XML(xml)
doc.search('//div[#id]/text()').select{ |n| n.text['Total'] }.first.text.split.last
and
doc.search('//div/text()').select{ |n| n.text['Total'] }.first.text.split.last
The difference is the first should run a bit faster if you know the div you're looking for always has an id.
If the ID always starts with "showTotalSubIndex" you could use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').first.text.split.last
and if you know there's only going to be one in the document, you can use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').text.split.last
EDIT:
Ryan posits the idea the XML structure might be consistent. If so:
doc.at('//div[2]').text[/(\$\d+)/, 1]
:-)
Related
The document going in has a structure like this:
<span class="footnote">Hello there, link</span>
The XPath search is:
#doc = set_nokogiri(html)
footnotes = #doc.xpath(".//span[#class = 'footnote']")
footnotes.each_with_index do |footnote, index|
puts footnote
end
The above footnote becomes:
<span>Hello there, link</span>
I assume my XPath is wrong but I'm having a hard time figuring out why.
I had the wrong tag in the output and should have been more careful. The point being that the <a> tag is getting stripped but its contents are still included.
I also added the set_nokogiri line in case that's relevant.
I can't duplicate the problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="footnote">Hello there, link</span>
EOT
footnotes = doc.xpath(".//span[#class = 'footnote']")
footnotes.to_xml # => "<span class=\"footnote\">Hello there, link</span>"
footnotes.each do |f|
puts f
end
# >> <span class="footnote">Hello there, link</span>
An additional problem is that the <a> tag has an invalid href URL.
link
should be:
link
I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"
When parsing HTML document, how Nokogiri handle <br> tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br> tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you? is not a content of the first <br> as it would be in XML).
Here's how Nokogiri behaves when parsing (malformed) XML:
require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>
Here's how Nokogiri behaves when parsing HTML:
require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>
p doc.at('div').text
#=> "HelloWorld"
I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A <br> is not something special, and so appropriately Nokogiri does not treat it differently than any other element.
If you want it to be treated as a newline, you can do this:
doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"
Similarly, if you wanted a space instead:
doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"
You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:
require 'nokogiri'
doc = Nokogiri::HTML(<<-EOS
<div>
Hi <br>
How are you? <br>
</div>
EOS
)
doc.xpath("//br").each{ |e| puts e }
prints
<br>
<br>
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.
As far as I can remember from doing some HTML parsing last year it'll view them as separate.
EDIT: My bad, I've just got someone to send me the code and retested it, we ended up dealing with somethings including <br> separately.
I have this HTML code, that's on a single line:
<h3 class='r'>fkdsafjldsajl</h3><h3 class='r'>fkdsafjldsajl</h3>
Here is the line-friendly version (that i can't use)
<h3 class='r'>fkdsafjldsajl</h3>
<h3 class='r'>fkdsafjldsajl</h3>
And i'm trying to extract just the URLs, with this REGEX
/<h3 class="r"><a href="(.*)">(.*)<\/a>/
And it returns
www.google.com">fkdsafjldsajl</a></h3><h3 class='r'><a href="www.google.com"
What can I do to stop it when find a " ?
Sigh. Regex and HTML are such awkward bedfellows:
require 'nokogiri'
html = %q{<h3 class='r'>fkdsafjldsajl</h3><h3 class='r'>fkdsafjldsajl</h3>}
doc = Nokogiri::HTML(html)
puts doc.css('a').map{ |a| a['href'] }
# >> www.google.com
# >> www.google.com
This will find them, whether they are deeply nested or all on one line.
The problem is that * is greedy. Put a question mark after it to make it ungreedy.
Working regex (tested on rubular)
href\=\"(.*?)\"
in jquery its quite simple
for instance
$("br").parent().contents().each(function() {
but for nokogiri, xpath,
its not working out quite well
var = doc.xpath('//br/following-sibling::text()|//br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
fruits = doc.xpath('//br/../text()').map { |text| text.content.strip }
p fruits
__END__
<html>
<body>
<div>
apple<br>
banana<br>
cherry<br>
orange<br>
</div>
</body>
I'm not familiar with nokogiri, but are you trying to find all the children of any element that contains a <br/>? If so, then try:
//*[br]/node()
In any case, using text() will only match text nodes, and not any sibling elements, which may or may not be what you want. If you actually only want text nodes, then
//*[br]/text()
should do the trick.