Any string to XML in Ruby - ruby

I am trying to convert a random string (which is build in XML format) in to an xml, so I can apply the "to_hash" function to it.
This is what I have:
model = live_requests[3]
parser = XML::Parser.string(model)
model_xml = parser.parse
puts model.to_hash
Now why am I getting an error when 'model_xml' should be an XML file?
I am using LibXML by the way.
http://libxml.rubyforge.org/rdoc/index.html

Libxml does not support the to_hash method. If you are looking for a way to do this that doesn't require traversing XML nodes and bulding the hash manually you should take a look at Nori.
Nori.parse("<tag>This is the contents</tag>")
# => { 'tag' => 'This is the contents' }
If you want to learn how to traverse Libxml's node trees take a look at the answer to this question.

Related

How do I create a child element within a Nokogiri node?

I’m using Rails 4.2.7 with Nokogiri. I’m having trouble creating a child node. I have the following code
general = doc.xpath("//lomimscc:general")
description = Nokogiri::XML::Node.new "lomimscc:description", doc
string = Nokogiri::XML::Node.new "lomimscc:string", doc
string.content = scenario.abstract
string['language'] = 'en'
description << string
general << description
I want the “description” element to be a child element of the “general” element (and similarly I want the “string” element to be a child of the “description” element). However what is happening is that the description element is appearing as a sibling of the general element. How do I make the element appear as a child instead of a sibling?
The tutorials show how to do this in "Creating new nodes", but the simple example is:
require 'nokogiri'
doc = Nokogiri::XML('<root/>')
doc.at('root').add_child('<foo/>')
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo/>\n</root>\n"
Nokogiri makes it easy to build nodes using a string that contains the markup or nodes you want to add.
You should be able to build upon this easily.
This is also noted throughout the Node documentation any place you see "node_or_tags".
When I changed
general = doc.xpath("//lomimscc:general")
to
general = doc.xpath("//lomimscc:general").first
then everything worked as far as creating child nodes.

Specific Values in Json Parse

I am having difficulty getting to specific values when I parse a JSON file in Ruby. My JSON is based off of this link https://www.mcdonalds.com/services/mcd/us/restaurantLocator?latitude=40.7217861&longitude=-74.00944709999999&radius=8045&maxResults=100&country=us&language=en-us
No matter what I try I cannot pull the values I want, which is the "addressLine1" field. I get the following error:
`[]': no implicit conversion of String into Integer (TypeError)
Code
require 'json'
file = File.read('MCD.json')
data_hash = JSON.parse(file)
print data_hash.keys
print "\n"
print data_hash['features']['addressLine1']
data_hash['features'] is an array. Depending on what do you actually need, you might either iterate over it, or call:
data_hash['features'].first['properties']['addressLine1']
Note 'properties' there, since addressLine1 is not a direct descendant of 'features' elements.

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Ruby RDF query - extracting simple data from Seq and Bag items

I am receiving xml-serialised RDF (as part of XMP media descriptions in case that is relevent), and processing in Ruby. I am trying to work with rdf gem, although happy to look at other solutions.
I have managed to load and query the most basic data, but am stuck when trying to build a query for items which contain sequences and bags.
Example XML RDF:
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about='' xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:date>
<rdf:Seq>
<rdf:li>2013-04-08</rdf:li>
</rdf:Seq>
</dc:date>
</rdf:Description>
</rdf:RDF>
My best attempt at putting together a query:
require 'rdf'
require 'rdf/rdfxml'
require 'rdf/vocab/dc11'
graph = RDF::Graph.load( 'test.rdf' )
date_query = RDF::Query.new( :subject => { RDF::DC11.date => :date } )
results = date_query.execute(graph)
results.map { |result| { result.subject.to_s => result.date.inspect } }
=> [{"test.rdf"=>"#<RDF::Node:0x3fc186b3eef8(_:g70100421177080)>"}]
I get the impression that my results at this stage ("query solutions"?) are a reference to the rdf:Seq container. But I am lost as to how to progress. For the example above, I'd expect to end up, eventually, with an array ["2013-04-08"].
When there is incoming data without the rdf:Seq and rdf:li containers, I am able to extract the strings I want using RDF::Query, following examples at http://rdf.rubyforge.org/RDF/Query.html - unfortunately I cannot find any examples of more complex queries or RDF structures processed in Ruby.
Edit: In addition, when I try to find appropriate methods to use with the RDF::Node object, I cannot see any way to explore any further relations it may have:
results[0].date.methods - Object.methods
=> [:original, :original=, :id, :id=, :node?, :anonymous?, :unlabeled?, :labeled?, :to_sym, :resource?, :constant?, :variable?, :between?, :graph?, :literal?, :statement?, :iri?, :uri?, :valid?, :invalid?, :validate!, :validate, :to_rdf, :inspect!, :type_error, :to_ntriples]
# None of the above leads AFAICS to more data in the graph
I know how to get the same data in xpath (well, at least provided we always get the same paths in the serialisation), but feel it is not the best query language to use in this case (it's my backup plan, however, if it turns out too complex to implement an RDF-query solution)
I think you're correct when saying "my results at this stage ("query solutions"?) are a reference to the rdf:Seq container". RDF/XML is a really horrible serialisation format, instead think of the data as a graph. Here a picture of an RDF:Bag. RDF:Seq works the same and the #students in the example is analogous to the #date in your case.
So to get to the date literal, you need to hop one node further in the graph. I'm not familiar with the syntax of this Ruby library, but something like:
require 'rdf'
require 'rdf/rdfxml'
require 'rdf/vocab/dc11'
graph = RDF::Graph.load( 'test.rdf' )
date_query = RDF::Query.new({
:yourThing => {
RDF::DC11.date => :dateSeq
},
:dateSeq => {
RDF.type => RDF.Seq,
RDF._1 => :dateLiteral
}
})
date_query.execute(graph).each do |solution|
puts "date=#{solution.dateLiteral}"
end
Of course, if you expect the Seq to actually to contain multiple dates (otherwise it wouldn't make sense to have a Seq), you will have to match them with RDF._1 => :dateLiteral1, RDF._2 => :dateLiteral2, RDF._3 => :dateLiteral3 etc.
Or for a more generic solution, match all the properties and objects on the dateSeq with:
:dateSeq => {
:property => :dateLiteral
}
and then filter out the case where :property ends up being RDF:type while :dateLiteral isn't actually the date but RDF:Seq. Maybe the library has also a special method to get all the Seq's contents.

How to retrieve the nokogiri processing instruction attributes?

I am parsing the XML using Nokogiri.
I am able to retrieve the stylesheets. But not the attributes of each stylesheet.
1.9.2p320 :112 >style = xml.xpath('//processing-instruction("xml-stylesheet")').first
=> #<Nokogiri::XML::ProcessingInstruction:0x5459b2e name="xml-stylesheet">
style.name
=> "xml-stylesheet"
style.content
=> "type=\"text/xsl\" href=\"CDA.xsl\""
Is there any easy way to get the type, href attributes values?
OR
Only way is to parse the content(style.content) of the processing instruction ?
I solved this problem by following instruction in below answer.
Can Nokogiri search for "?xml-stylesheet" tags?
Added new to_element method to Nokogiri::XML::ProcessingInstruction class
class Nokogiri::XML::ProcessingInstruction
def to_element
document.parse("<#{name} #{content}/>")
end
end
style = xml.xpath('//processing-instruction("xml-stylesheet")').first
element = style.to_element
To retrieve the href attribute value
element.attribute('href').value
Cannot you do that?
style.content.attribute['type'] # or attr['type'] I am not sure
style.content.attribute['href'] # or attr['href'] I am not sure
Check this question How to access attributes using Nokogiri .

Resources