I have XML with additional data in comments at the start of the XML file.
I want to read these details for some logging.
Is there any way to read XML commented values in Ruby?
The commented details are:
<!-- sessionId="QQQQQQQQ" --><!-- ProgramId ="EP445522" -->
Extracting values using Nokogiri and XPath:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML('<!-- sessionId="QQQQQQQQ" --><!-- ProgramId ="EP445522" -->')
comments = doc.xpath('//comment()')
tags = Hash[*comments.map { |c| c.content.match(/(\S+)\s*="(\w+)"/).captures }.flatten]
puts tags.inspect
# => {"sessionId"=>"QQQQQQQQ", "ProgramId"=>"EP445522"}
You could do something like this:
require 'rexml/document'
require 'rexml/xpath'
File::open('q.xml') do |fd|
xml = REXML::Document::new(fd)
REXML::XPath::each(xml.root, '//comment()') do |comment|
case comment.string
when /sessionId/
puts comment.string
when /ProgramId/
puts comment.string
end
end
end
Effectively what this does is loop through all comment nodes, then look for the strings you're interested in, such as sessionId. Once you've found the nodes you're looking for, you can process them using Ruby to extract the information you need.
Related
I have an XML document:
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
My code is:
require 'nokogiri'
require 'selwet'
context "parse xml" do doc = Nokogiri::XML(File.open("test.xml"))
doc.xpath("cred/login").each do
|char_element|
puts char_element.text
end
should "check" do
Unit.go_to "http://www.ya.ru/"
Unit.click '.b-inline'
Unit.fill '[name="login"]', #login
end
When I run my test I get:
Tove
0
But I want to insert the parse result to #login. How can I get variables with the parsing result? Do I need to insert the login and pass values from the XML to fields in the web page?
You can get value of login from your XML with
#login = doc.xpath('//cred/login').text
I'd use something like this to get the values:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
EOT
login = doc.at('login').text # => "Tove"
pass = doc.at('pass').text # => "Jani"
Nokogiri makes it really easy to access values using CSS, so use it for readability when possible. The same thing can be done using XPath:
login = doc.at('//login').text # => "Tove"
pass = doc.at('//pass').text # => "Jani"
but having to add // twice to accomplish the same thing is usually wasted effort.
The important part is at, which returns the first occurrence of the target. at allows us to use either CSS or XPath, but CSS is usually less visually noisy.
I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.
Is it possible to do multi domain searches using Nokogiri. I am aware you can do multiple Xpath/CSS searches for a single domain/page but multi domain?
For example I want to scrape http://www.asus.com/Notebooks_Ultrabooks/S56CA/#specifications and http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications
My Code
require 'nokogiri'
require 'open-uri'
require 'spreadsheet'
doc = Nokogiri::HTML(open("http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications"))
#Grab our product specifications
data = doc.css('div#specifications div#spec-area ul.product-spec li')
#Modify our data
lines = data.map(&:text)
#Create the Spreadsheet
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1.name = 'My First Worksheet'
#Output our data to the Spreadsheet
lines.each.with_index do |line, i|
sheet1[i, 0] = line
end
book.write 'C:/Users/Barry/Desktop/output.xls'
Nokogiri has no concept of URLs, it only knows about a String or IO stream of XML or HTML. You're confusing OpenURI's purpose with Nokogiri's.
If you want to read from multiple sites, simply loop over the URLs, and pass the current URL to OpenURI to open the page:
%w[
http://www.asus.com/Notebooks_Ultrabooks/S56CA/#specifications
http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications
].each do |url|
doc = Nokogiri::HTML(open(url))
# do somethng with the document...
end
OpenURI will read the page, and pass its contents to Nokogiri for parsing. Nokogiri will still only see one page at a time, because that's all it is passed by OpenURI.
My task
Extract all specifications from http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications and put it in a spreadsheet (we work on formatting later)
Problem
Spreadsheet is created but my output is returning blank.
My Code
require 'Nokogiri'
require 'open-uri'
require 'spreadsheet'
doc = Nokogiri::HTML(open("http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications"))
data = puts doc.css('//div#specifications/div#spec-area/ul#product-spec/li')
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1.name = 'My First Worksheet'
sheet1[0,0] = data
book.write 'C:/Users/Barry/Desktop/output.xls'
The following code worked for me
require 'Nokogiri'
require 'open-uri'
require 'spreadsheet'
doc = Nokogiri::HTML(open("http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications"))
data = doc.css('div#specifications div#spec-area ul.product-spec')[0].text
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1.name = 'My First Worksheet'
sheet1[0,0] = data
book.write 'C:/Users/Barry/Desktop/output.xls'
There are a few problems here:
It looks like you’re trying to debug by printing out the result of the css call in the line:
data = puts doc.css('//div#specifications/div#spec-area/ul#product-spec/li')
The method puts returns nil, so data will be nil and will result in nothing being shown.
In the page you’re parsing, the product-spec list is in fact a class, not an id, so you need .product-spec (. instead of #).
The syntax you’re using isn’t actually CSS, it looks like you’re mixing CSS and Xpath. You want something like this:
doc.css('div#specifications div#spec-area ul.product-spec li')
(This last point doesn’t seem to actually affect the result. Nokogiri converts CSS selectors to xpath and it appears that the transformation results in valid xpath anyway).
I don't know what name this goes by and that's been complicating my search.
My data file OX.session.xml is in the (old?) form
<?xml version="1.0" encoding="utf-8"?>
<CAppLogin xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://oxbranch.optionsxpress.com">
<SessionID>FE5E27A056944FBFBEF047F2B99E0BF6</SessionID>
<AccountNum>8228-5500</AccountNum>
<AccountID>967454</AccountID>
</CAppLogin>
What is that XML data format called exactly?
Anyway, all I want is to end up with one hash in my Ruby code like so:
CAppLogin = { :SessionID => "FE5E27A056944FBFBEF047F2B99E0BF6", :AccountNum => "8228-5500", etc. } # Doesn't have to be called CAppLogin as in the file, may be fixed
What might be shortest, most built-in Ruby way to automate that hash read, in a way I can update the SessionID value and store it easily back into the file for later program runs?
I've played around with YAML, REXML but would rather not yet print my (bad) example trials.
There are a few libraries you can use in Ruby to do this.
Ruby toolbox has some good coverage of a few of them:
https://www.ruby-toolbox.com/categories/xml_mapping
I use XMLSimple, just require the gem then load in your xml file using xml_in:
require 'xmlsimple'
hash = XmlSimple.xml_in('session.xml')
If you're in a Rails environment, you can just use Active Support:
require 'active_support'
session = Hash.from_xml('session.xml')
Using Nokogiri to parse the XML with namespaces:
require 'nokogiri'
dom = Nokogiri::XML(File.read('OX.session.xml'))
node = dom.xpath('ox:CAppLogin',
'ox' => "http://oxbranch.optionsxpress.com").first
hash = node.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.to_sym] = e.content
end
puts hash.inspect
# {:SessionID=>"FE5E27A056944FBFBEF047F2B99E0BF6",
# :AccountNum=>"8228-5500", :AccountID=>"967454"}
If you know that the CAppLogin is the root element, you can simplify a bit:
require 'nokogiri'
dom = Nokogiri::XML(File.read('OX.session.xml'))
hash = dom.root.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.to_sym] = e.content
end
puts hash.inspect
# {:SessionID=>"FE5E27A056944FBFBEF047F2B99E0BF6",
# :AccountNum=>"8228-5500", :AccountID=>"967454"}