Crystal xpath library or implementation - xpath

Is there any way to use xpath when parsing an HTML file ?
I am looking for a Ruby Nokogiri equivalent but Crystagiri does not implement it ( yet ? ). Also tried myhtml and modest but to no avail.

You don't need to use external libraries for this! Crystal has an XML module built in, which has xpath support.
Here's a basic example:
nodes = XML.parse_html(html_content)
nodes.xpath_nodes(query).each do |node|
# do something
end
where html_content is your HTML as a string, and query is your xpath query.

found one : hq from maiha
it implements xpath by wrapping the Crystal XML and myhtml and works well.
require "hq"
node = Hq.parse("<html><body><div>82 users</div></body></html>")
node.xpath("/html/body/div").text # => "82 users"
node.xpath("/html/body/div").int # => 82

Related

Removing XML tags when parsing XML

Using Ruby with Nokogiri is there an easy way to remove tags around returned results? I can't find one in the docs.
Example from the Nokogiri site:
characters[0].to_s # => "<character>Al Bundy</character>"
I was hoping to get:
Al Bundy
Try using the text method:
characters[0].text
You can use the .inner_html method. Here is an example you can use from a basic xml sitemap:
parse_content.css("url").each do |x|
location = x.css("loc").inner_html
last_mod = x.css("lastmod").inner_html
end
You can read about sitemaps here: https://www.sitemaps.org/protocol.html

How to get HTML of an element when using Poltergeist?

I'm using Capybara with the Poltergeist driver. My question is: how to get the HTML (string) of a node?
I've read that using the RackTest driver you can get it like this:
find("table").native #=> native Nokogiri element
find("table").native.to_html #=> "..."
But with Poltergeist calling #native on a node returns a Capybara::Poltergeist::Node, not a native Nokogiri element. And then calling #native again on the Capybara::Poltergeist::Node returns the same Capybara::Poltergeist::Node again (that is, it returns self).
It has become slightly irritating having to look at the HTML from the entire page to find what I'm looking for :P
I am adding this answer for others who land here. The solution is dead simple.
following the example you provided it would be:
find("table")['outerHTML']
I also find Poltergeist irritating. Here's what I did:
def nokogiri(selector)
nokogiri = Nokogiri::HTML(page.html);
return nokogiri.css(selector)[0]
end
This takes a css selector, and returns a native nokogiri element, rather than poltergeist's idiocy. You'll also have to require 'nokogiri', but it shouldn't be a problem since it's a dependency for poltergeist.
Its can be done like this
lets say on google.co.in you wana fetch INDIA
on step.rb file under your function write this line
x = page.find(:xpath,'//*[#id="hplogo"]/div' , :visible => false).text
puts x
x will display "India"
Terminal o/p

How do I get the input value from a Nokogiri::XML::NodeSet?

I am looking for my input element using Nokogiri's xpath method.
It's returning an object of class Nokogiri::XML::NodeSet:
[#<Nokogiri::XML::Element:0x3fcc0e07de14 name="input" attributes=[#<Nokogiri::XML::Attr:0x3fcc0e07dba8 name="type" value="text">, #<Nokogiri::XML::Attr:0x3fcc0e07db94 name="name" value="creditInstallmentAmount">, #<Nokogiri::XML::Attr:0x3fcc0e07db44 name="style" value="width:240px">, #<Nokogiri::XML::Attr:0x3fcc0e07dae0 name="value" value="94.8">, #<Nokogiri::XML::Attr:0x3fcc0e07da18 name="readonly" value="true">]>
Is there a faster and cleaner way to get the value of input than casting this using to_s:
"<input type=\"text\" name=\"creditInstallmentAmount\" style=\"width:240px\" value=\"94.8\" readonly>"
and match with regular expressions?
A couple things will help:
Nokogiri has the at method, which is the equivalent of search(...).first, and, instead of returning a NodeSet, it returns the Node itself, making it easy to grab values from it:
require 'nokogiri'
doc = Nokogiri::HTML('<input type="text" name="creditInstallmentAmount" style="width:240px" value="94.8" readonly>')
doc.at('input')['value'] # => "94.8"
doc.at('input')['value'].to_f # => 94.8
Also, notice I'm using CSS notation, instead of XPath. Nokogiri supports both, and a lot of times the CSS is more obvious and easily readable. The at_css method is an alias to at for convenience.
Note that Nokogiri uses a little test in search and at to try to determine whether the selector is CSS or XPath, and then branches accordingly to the specific method. The test can be fooled, at which point you should use the specific CSS or XPath variant, or always use them if you're paranoid. In years of using Nokogiri I've only once encountered the situation where the code was confused.
If you want to be more explicit about which input you want, you can look into the parameters for the tag:
doc.at('input[#name="creditInstallmentAmount"]')['value'] # => "94.8"
Get familiar with the difference between search and at and their varients, and Nokogiri will really become useful to you. Learn how to access the parameters and text() nodes and you'll know 99% of what you need to know for parsing HTML and XML.
Ok, I found the answer:
.map{|node| node["value"]}.first
Ok, this works for me
require 'nokogiri'
require 'open-uri'
html = open ARGV[0]
doc = Nokogiri::HTML(html)
inputs = doc.search 'input'
inputs.map{|node| node['name']}
or all in one
inputs = Nokogiri::HTML(html).search('input').map{|node| node['name']}

Extracting text from a webtable in Watir / Ruby

I'm trying to extract the data from an Income Statement, url is http://finance.yahoo.com/q/is?s=LMT+Income+Statement&annual
I was unable to find the table using the browser.table(:name, 'blah') or (:id, 'blah'), but had some luck using the xpath with Nokogiri using this code, which picks up after I've initialized everything and browsed to the page:
page_html = Nokogiri::HTML.parse(browser.html)
tobj = page_html.xpath('//*[#id="yfncsumtab"]').inner_text
Now I'm able to take tobj and pull the data out, but it doesn't do me any good for trying to manipulate the object as a table. Any suggestions on how to go about storing the table as a variable would help. I can probably figure out iterating through the rows/columns from there, but I wouldn't mind if you tacked on some code that would do that.
Do you know Watir has xpath support?
browser.element(:xpath => '//*[#id="yfncsumtab"]')
Look at it this way:
doc = Nokogiri::HTML.parse(browser.html)
table = doc.at('table#yfncsumtab')
# iterate through tr's
table.search('tr').each do |tr|
# do something with tr
end
Try browser.element(id: "yfncsumtab").text

How to build an Object from XML, modify, then write to File in Ruby

I'm having an awful time trying to use a library to parse an XML File into a hash like object, modify it, then print it back out to another XML file in Ruby. For a class I'm taking, we're supposed to use a Java JAXB like library where we convert XML into an object. We've already done SAX and DOM methods so we can't use those methods of XML de-serialization. Nokogiri helped me with both of these in Ruby.
The only problem is that besides the SIMPLE modifications I'm making to the objects, when I write to file it has drastic differences. Is there a Ruby library meant for doing just this? I've tried: ROXML, XML::Mapping, and ActiveSupport::CoreExt. The only one I can get to even run is ActiveSupport, and even then it starts putting element attributes as child elements in the output XML.
I'm willing to try out XmlSimple, but I'm curious has anyone actually had to do this before/run into the same problems? Again, I can't read in lines one at a time like SAX or build a Tree like structure like DOM, it needs to be a hash like object.
Any help is much appreciated!
You should have a look into nokogiri: http://nokogiri.org/
Then you can parse the XML like this :
xml_file = "some_path"
#xml = Nokogiri::XML(File.open xml_file)
#xml.xpath('//listing').each do |node|
style = node.search("style").text
end
With Xpath, you can perform queries in the XML :
#xml.xpath("//listing[name='John']").first(10)
OK, I got it working. After looking at ActiveSupport::CoreExt 's source code I found it just uses a gem called xml-simple. What's obnoxious is the gem, library name in the require statement, and class name are a mixture of hyphenated and non hyphenated spellings. For future reference here's what I did:
# gem install xml-simple
# ^ all lowercase, hyphenated
require 'xmlsimple'
# ^ all lowercase, not hyphenated
doc = XmlSimple.xml_in 'hw3.xml', 'KeepRoot' => true
# ^ Camel cased (it's a class), not hyphenated
# doc.class => Hash
# manipulate doc as a hash
file = File.new('HW3a.xml', 'w')
file.write("<?xml version='1.0' encoding='utf-8'?>\n")
file.write(XmlSimple.xml_out doc, 'KeepRoot' => true)
I hope this helps someone. Also make sure you pay attention to case and hyphens with this gem!!!

Resources