So I am using this:
Net::HTTP.get(URI.parse(url))
Works perfect.
Issue I am having is that the page it gets is formatted with head, html, body, etc tags.
There is a label element in the body with an id of "Result" I only want to get me back the text of "Result". Not all the html formatting.
Can this be done?
Well, to get only a part of a content in HTML you have to use a HTML parser, which will be Nokogiri in this case .
doc = Nokogiri::HTML(open(url))
doc.css('#Result').each do |re|
puts re.to_s
#puts re.content
end
Related
I'm trying to parse a html raw file using nokogiri.
html_file = URI.open(url).read
html_doc = Nokogiri::HTML(html_file)
puts html_doc.search("p", "h2").map(&:text)
When I do this, I get all the "p" text and then all the "h2" text. Is there a way to get them in the order that they appear in the original text?
I tried something like this below but it doesn't quite work
puts html_doc.search("p" || "h2").map(&:text)
Sorry, found my own answer.
puts html_doc.search("p, h2").map(&:text)
I have an XML file, and before I process it I need to make sure that a certain element exists and is not blank.
Here is the code I have:
CSV.open("#{csv_dir}/products.csv","w",{:force_quotes => true}) do |out|
out << headers
Dir.glob("#{xml_dir}/*.xml").each do |xml_file|
gdsn_doc = GDSNDoc.new(xml_file)
logger.info("Processing xml file #{xml_file}")
:x
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
row = []
headers.each do |col|
row << product[col]
end
out << row
end
end
end
The following code is not working to find the "description" element and to check whether it is blank or not:
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
Here is a sample of the XML file:
<productData>
<description>Chocolate biscuits </description>
<productData>
This is how I have defined the class and Nokogiri:
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
The code had to be moved up to an earlier stage, where Nokogiri was initialised. It doesn't get runtime errors, but it does let XML files with blank descriptions get through and it shouldn't.
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
desc_exists = #doc.xpath("//productData/descriptions")
if !desc_exists.empty?
You are creating your instance like this:
gdsn_doc = GDSNDoc.new(xml_file)
then use it like this:
#desc_exists = #gdsn_doc.xpath("//productData/description")
#gdsn_doc and gdsn_doc are two different things in Ruby - try just using the version without the #:
#desc_exists = gdsn_doc.xpath("//productData/description")
The basic test is to use:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<productData>
<description>Chocolate biscuits </description>
<productData>
EOT
# using XPath selectors...
doc.xpath('//productData/description').to_html # => "<description>Chocolate biscuits </description>"
doc.xpath('//description').to_html # => "<description>Chocolate biscuits </description>"
xpath works fine when the document is parsed correctly.
I get an error "undefined method 'xpath' for nil:NilClass (NoMethodError)
Usually this means you didn't parse the document correctly. In your case it's because you're not using the right variable:
gdsn_doc = GDSNDoc.new(xml_file)
...
#desc_exists = #gdsn_doc.xpath("//productData/description")
Note that gdsn_doc is not the same as #gdsn_doc. The later doesn't appear to have been initialized.
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
While that should work, it's idiomatic to write it as:
#doc = Nokogiri::XML(File.read(xml_file))
File.open(...) do ... end is preferred if you're processing inside the block and want Ruby to automatically close the file. That isn't necessary when you're simply reading then passing the content to something else for processing, hence the use of File.read(...) which slurps the file. (Slurping isn't necessary a good practice because it can have scalability problems, but for reasonable sized XML/HTML it's OK because it's easier to use DOM-based parsing than SAX.)
If Nokogiri doesn't raise an exception it was able to parse the content, however that still doesn't mean the content was valid. It's a good idea to check
#doc.errors
to see whether Nokogiri/libXML had to do some fix-ups on the content just to be able to parse it. Fixing the markup can change the DOM from what you expect, making it impossible to find a tag based on your assumptions for the selector. You could use xmllint or one of the XML validators to check, but Nokogiri will still have to be happy.
Nokogiri includes a command-line version nokogiri that accepts a URL to the document you want to parse:
nokogiri http://example.com
It'll open IRB with the content loaded and ready for you to poke at it. It's very convenient when debugging and testing. It's also a decent way to make sure the content actually exists if you're dealing with HTML containing DHTML that loads parts of the page dynamically.
I know that I can parse and render an HTML document with Kramdown in ruby using something like
require 'kramdown'
s = 'This is a _document_'
Kramdown::Document.new(s).to_html
# '<p>This is a <i>document</i></p>'
In this case, the string s may contain a full document in markdown syntax.
What I want to do, however, is to parse s assuming that it only contains span-level markdown syntax, and obtain the rendered html. In particular there should be no <p>, <blockquote>, or, e.g., <table> in the rendered html.
s = 'This is **only** a span-level string'
# .. ??? ...
# 'This is <b>only</b> a span-level string'
How can I do this?
I would post-process the output with the sanitize gem.
require 'sanitize'
html = Kramdown::Document.new(s).to_html
output = Sanitize.fragment(html, elements:['b','i','em'])
The elements are a whitelist of allowed tags, just add all the tags you want. The gem has a set of predefined whitelists, but none match exactly what you're looking for. (BTW, if you want a list of all the HTML5 elements allowed in a span, see the WHATWG's list of "phrasing content").
I know this wasn't tagged rails, but for the benefit of readers using Rails: use the built-in sanitize helper.
You can create a custom parser, and empty its internal list of block-level parsers.
class Kramdown::Parser::SpanKramdown < Kramdown::Parser::Kramdown
def initialize(source, options)
super
#block_parsers = []
end
end
Then you can use it like this:
text = Kramdown::Document.new(text, :input => 'SpanKramdown').to_html
This should do what you want "the right way".
I want to edit a node of each item of an RSS feed in Ruby with Nokogiri and XPath.
I can get the value of this node but I can not edit them:
doc = Nokogiri::XML(open("http://www.pcinpact.com/rss/news.xml"))
doc.xpath('//item').each do |i|
pp i.xpath('title').first.text
end
I get the value of the title node in each item node.
I want to edit the "content" but I can't find how with xpath.
Obviously I want to get my original XML with the modifications.
Any idea?
For setting the content use the content= method.
doc = Nokogiri::XML(open("http://www.pcinpact.com/rss/news.xml"))
doc.xpath('//item').each do |i|
pp i.xpath('title').first.content = "My new title"
end
For more on how to manipulate a document in Nokogiri, check out "Modifying an HTML / XML Document".
I am struggling with mechanize. I wish to "click" on a set of links which can only be identified by their position (all links within div#content) or their href.
I have tried both of these identification methods above without success.
From the documentation, I could not figure out how return a collection of links (for clicking) based on their position in the DOM, and not by attributes directly on the link.
Secondly, the documentation suggested you can you use :href to match a partial href,
page = agent.get('http://foo.com/').links_with(:href => "/something")
but the only way I can get it to return a link is by passing a fully qualified URL, e.g
page = agent.get('http://foo.com/').links_with(:href => "http://foo.com/something/a")
This is not very usefull if i want to return a collection of links with href's
http://foo.com/something/a
http://foo.com/something/b
http://foo.com/something/c
etc...
Am I doing something wrong? do I have unrealistic expectations?
Part II
The value you pass to :href has to be an exact match by default. So the href in your example would only match and not
What you want to do is to pass in a regex so that it will match a substring within the href field. Like so:
page = agent.get('http://foo.com/').links_with(:href => %r{/something/})
edit:
Part I
In order to get it to select links only in a link, add a nokogiri-style search method into your string. Like this:
page = agent.get('http://foo.com/').search("div#content").links_with(:href => %r{/something/}) # **
Ok, that doesn't work because after you do page = agent.get('http://foo.com/').search("div#content") you get a Nokogiri object back instead of a mechanize one, so links_with won't work. However you will be able to extract the links from the Nokogiri object using the css method. I would suggest something like:
page = agent.get('http://foo.com/').search("div#content").css("a")
If that doesn't work, I'd suggest checking out http://nokogiri.org/tutorials
The nth link:
page.links[n-1]
The first 5 links:
page.links[0..4]
links with 'something' in the href:
page.links_with :href => /something/
You can get mechanize links using nokogiri nodes. See the source code of links() method.
# File lib/mechanize/page.rb, line 352
def links
#links ||= %w{ a area }.map do |tag|
search(tag).map do |node|
Link.new(node, #mech, self)
end
end.flatten
end
So that means:
the_links= page.search("valid_selector").map do |node|
Mechanize::Page::Link.new(node, agent, page)
end
This will give you the useful href, text and uri methods.