I have this code, and I need to add a regex ahead of "href=" for integers:
f = File.open("us.html")
doc = Nokogiri::HTML(f)
ans = doc.css('a[href=]')
puts doc
I tried doing:
ans = doc.css('a[href=\d]
or:
ans = doc.css('a[href="\d"])
but it doesn't work. Can anyone suggest a workaround?
If you want to use a regular expression, I believe you will have to do that manually. It cannot be done with a CSS or XPath selector.
You can do it by iterating through the elements and comparing their href attribute to your regular expression. For example:
html = %q{
<html>
<a href='1'></a>
<a href='adf'></a>
</html>
}
doc = Nokogiri::HTML(html)
ans = doc.css('a[href]').select{ |e| e['href'] =~ /\d/}
#=>
You can do it in XPath:
require 'nokogiri'
html = %q{
<html>
<a href='1'></a>
<a href='adf'></a>
</html>
}
doc = Nokogiri::HTML(html)
puts doc.xpath('//a[#href[number(.) = .]]')
#=>
The XPath function number() does a conversion to a number. If it equals the node itself, then the node is a number. It is even possible to check a range using inequality operators.
Related
I'm using xpath to get some values on a website like this
auction_page = Nokogiri::HTML open(a, "User-Agent" => theagent)
auction_links = auction_page.xpath('//iframe[contains(#src, "near")]/#src')
Which returns what I need like this
#<Nokogiri::XML::Attr:0x3fcd7bef5730 name="src" value="http://thevalue.com">
I just want to get the value, not the value or anything else. How do I do this?
I think you are looking for the .text method.
So auction_links.text should return "http://thevalue.com".
Edit:
If that doesn't work try, auction_links.first which will return an array, I'm sure the link will be inside there. ; )
For further reference, here is a great tutorial for basic Nokogiri Crawling/Parsing.
You could do this as below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-end
<a id = "foo" class="bar baz" href = "www.test.com"> click here </a>
end
doc.at_xpath("//a[contains(#class,'bar')]/#href").to_s
# => "www.test.com"
So in your case you can write:
auction_page.at_xpath('//iframe[contains(#src, "near")]/#src').to_s
# => "http://thevalue.com"
Is it possible to convert HTML with Nokogiri to plain text? I also want to include <br /> tag.
For example, given this HTML:
<p>ala ma kota</p> <br /> <span>i kot to idiota </span>
I want this output:
ala ma kota
i kot to idiota
When I just call Nokogiri::HTML(my_html).text it excludes <br /> tag:
ala ma kota i kot to idiota
Instead of writing complex regexp I used Nokogiri.
Working solution (K.I.S.S!):
def strip_html(str)
document = Nokogiri::HTML.parse(str)
document.css("br").each { |node| node.replace("\n") }
document.text
end
Nothing like this exists by default, but you can easily hack something together that comes close to the desired output:
require 'nokogiri'
def render_to_ascii(node)
blocks = %w[p div address] # els to put newlines after
swaps = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" } # content to swap out
dup = node.dup # don't munge the original
# Get rid of superfluous whitespace in the source
dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }
# Swap out the swaps
dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }
# Slap a couple newlines after each block level element
dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }
# Return the modified text content
dup.text
end
frag = Nokogiri::HTML.fragment "<p>It is the end of the world
as we
know it<br>and <i>I</i> <strong>feel</strong>
<a href='blah'>fine</a>.</p><div>Capische<hr>Buddy?</div>"
puts render_to_ascii(frag)
#=> It is the end of the world as we know it
#=> and I feel fine.
#=>
#=> Capische
#=> ----------------------------------------------------------------------
#=> Buddy?
Try
Nokogiri::HTML(my_html.gsub('<br />',"\n")).text
Nokogiri will strip out links, so I use this first to preserve links in the text version:
html_version.gsub!(/<a href.*(http:[^"']+).*>(.*)<\/a>/i) { "#{$2}\n#{$1}" }
that will turn this:
link to google
to this:
link to google
http://google.com
If you use HAML you can solve html converting by putting html with 'raw' option, f.e.
= raw #product.short_description
Using this code:
doc = Nokogiri::HTML(open("text.html"))
doc.xpath("//span[#id='startsWith_']").remove
I would like to select every span#id starting with 'startsWith_' and remove it. I tried searching, but failed.
Here's an example:
require 'nokogiri'
html = '
<html>
<body>
<span id="doesnt_start_with">foo</span>
<span id="startsWith_bar">bar</span>
</body>
</html>'
doc = Nokogiri::HTML(html)
p doc.search('//span[starts-with(#id, "startsWith_")]').to_xml
That's how to select them.
doc.search('//span[starts-with(#id, "startsWith_")]').each do |n|
n.remove
end
That's how to remove them.
p doc.to_xml
# >> "<span id=\"startsWith_bar\">bar</span>"
# >> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n <span id=\"doesnt_start_with\">foo</span>\n \n</body></html>\n"
The page "XPath, XQuery, and XSLT Functions" has a list of the available functions.
Try this xpath expression:
//span[starts-with(#id, 'startsWith_')]
Say I have:
<div class="amt" id="displayFare-1_69-61-0" style="">
<div class="per">per person</div>
<div class="per" id="showTotalSubIndex-1_69-61-0" style="">Total $334</div>
$293
</div>
I want to grab just the $334. It will always have "Total $" but the id showTotalSubIndex... will be dynamic so I can't use that.
You can use a nokogiri xpath expression to iterate over all the div nodes
and scan the string for the 'Total $' Prefix like this
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML.parse( open( "test.xml" ))
doc.xpath("//div/text()").each{ |t|
tmp = t.to_str.strip
puts tmp[7..-1] if tmp.index('Total $') == 0
}
Rather than finding the text:
html = Nokogiri::HTML(html)
html.css("div.amt").children[1].text.gsub(/^Total /, '')
I assume here that the HTML is structured in such a way that the second child of any div.amt tag is the value that you're after, and then we'll just grab the text of that and gsub it.
Both of these work:
require 'nokogiri'
doc = Nokogiri::XML(xml)
doc.search('//div[#id]/text()').select{ |n| n.text['Total'] }.first.text.split.last
and
doc.search('//div/text()').select{ |n| n.text['Total'] }.first.text.split.last
The difference is the first should run a bit faster if you know the div you're looking for always has an id.
If the ID always starts with "showTotalSubIndex" you could use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').first.text.split.last
and if you know there's only going to be one in the document, you can use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').text.split.last
EDIT:
Ryan posits the idea the XML structure might be consistent. If so:
doc.at('//div[2]').text[/(\$\d+)/, 1]
:-)
If I have a bunch of elements like:
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>
Is there a built-in method in Nokogiri that would get me all p elements that contain the text "Apple"? (The example element above would match, for instance).
Nokogiri can do this (now) using jQuery extensions to CSS:
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
doc.at('p:contains("bar")').text.strip
=> "bar"
Here is an XPath that works:
require 'nokogiri'
doc = Nokogiri::HTML(DATA)
p doc.xpath('//li[contains(text(), "Apple")]')
__END__
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>
You can also do this very easily with Nikkou:
doc.search('p').text_includes('bar')
Try using this XPath:
p = doc.xpath('//p[//*[contains(text(), "Apple")]]')