I am parsing the XML using Nokogiri.
I am able to retrieve the stylesheets. But not the attributes of each stylesheet.
1.9.2p320 :112 >style = xml.xpath('//processing-instruction("xml-stylesheet")').first
=> #<Nokogiri::XML::ProcessingInstruction:0x5459b2e name="xml-stylesheet">
style.name
=> "xml-stylesheet"
style.content
=> "type=\"text/xsl\" href=\"CDA.xsl\""
Is there any easy way to get the type, href attributes values?
OR
Only way is to parse the content(style.content) of the processing instruction ?
I solved this problem by following instruction in below answer.
Can Nokogiri search for "?xml-stylesheet" tags?
Added new to_element method to Nokogiri::XML::ProcessingInstruction class
class Nokogiri::XML::ProcessingInstruction
def to_element
document.parse("<#{name} #{content}/>")
end
end
style = xml.xpath('//processing-instruction("xml-stylesheet")').first
element = style.to_element
To retrieve the href attribute value
element.attribute('href').value
Cannot you do that?
style.content.attribute['type'] # or attr['type'] I am not sure
style.content.attribute['href'] # or attr['href'] I am not sure
Check this question How to access attributes using Nokogiri .
Related
I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.
I'm trying to extract a link from an element (.jobtitle a) using mechanize. I'm trying to do that in the link variable below. Anyone know how?
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://id.indeed.com/')
indeed_form = page.form('jobsearch')
indeed_form.q = ''
indeed_form.l = 'Indonesia'
page = agent.submit(indeed_form)
page.search(".row , .jobtitle a").each do |job|
job_title = job.search(".jobtitle a").map(&:text).map(&:strip)
company = job.search(".company span").map(&:text).map(&:strip)
date = job.search(".date").map(&:text).map(&:strip)
location = job.search(".location span").map(&:text).map(&:strip)
summary = job.search(".summary").map(&:text).map(&:strip)
link = job.search(".jobtitle a").map(&:text).map(&:strip)
end
I don't think you can select attributes with css paths.
From the mechanize documentation:
search()
Search for paths in the page using Nokogiri's search. The paths can be XPath or CSS and an optional Hash of namespaces may be appended.
See Nokogiri::XML::Node#search for further details.
You should check out XPaths instead. See e.g.:
Getting attribute using XPath
http://www.w3schools.com/xpath/
You may need to rewrite the way you iterate through the page.
I am trying to convert a random string (which is build in XML format) in to an xml, so I can apply the "to_hash" function to it.
This is what I have:
model = live_requests[3]
parser = XML::Parser.string(model)
model_xml = parser.parse
puts model.to_hash
Now why am I getting an error when 'model_xml' should be an XML file?
I am using LibXML by the way.
http://libxml.rubyforge.org/rdoc/index.html
Libxml does not support the to_hash method. If you are looking for a way to do this that doesn't require traversing XML nodes and bulding the hash manually you should take a look at Nori.
Nori.parse("<tag>This is the contents</tag>")
# => { 'tag' => 'This is the contents' }
If you want to learn how to traverse Libxml's node trees take a look at the answer to this question.
I tried meta_search, but after adding "include MetaSearch::Searches::ActiveRecord" into my model, it raised an error as "undefined method `joins_values'" when run "MyModel.search(params[:search])"
I think I dont need full text, so I think following gems are not suitable for my project now::
mongoid_fulltext
mongoid-sphinx
sunspot_mongoid
mongoid_search
I tried a old gem named scoped-search
I can make it work for example:
get :search do
#search = Notification.scoped_search(params[:search]
search_scope = #search.scoped
defaul_scope = current_user.notifications
result_scope = search_scope.merge defaul_scope
#notifications = result_scope
render 'notifications/search'
end
but it will be allow to call any scopes in my model.
Is there any "best practice" for doing this job ?
If you want limit the scope you want use on your scoped_search you can filter your params[:search] like :
def limit_scope_search
params[:search].select{|k,v| [:my_scope, :other_scope_authorized].include?(k) }
end
I try to access a form using mechanize (Ruby).
On my form I have a gorup of Radiobuttons.
So I want to check one of them.
I wrote:
target_form = (page/:form).find{ |elem| elem['id'] == 'formid'}
target_form.radiobutton_with(:name => "radiobuttonname")[2].check
In this line I want to check the radiobutton with the value of 2.
But in this line, I get an error:
: undefined method `radiobutton_with' for #<Nokogiri::XML::Element:0x9b86ea> (NoMethodError)
The problem occured because using a Mechanize page as a Nokogiri document (by calling the / method, or search, or xpath, etc.) returns Nokogiri elements, not Mechanize elements with their special methods.
As noted in the comments, you can be sure to get a Mechanize::Form by using the form_with method to find your form instead.
Sometimes, however, you can find the element you want with Nokogiri but not with Mechanize. For example, consider a page with a <select> element that is not inside a <form>. Since there is no form, you can't use the Mechanize field_with method to find the select and get a Mechanize::Form::SelectList instance.
If you have a Nokogiri element and you want the Mechanize equivalent, you can create it by passing the Nokogiri element to the constructor. For example:
sel = Mechanize::Form::SelectList.new( page.at_xpath('//select[#name="city"]') )
In your case where you had a Nokogiri::XML::Element and wanted a Mechanize::Form:
# Find the xml element
target_form = (page/:form).find{ |elem| elem['id'] == 'formid'}
target_form = Mechanize::Form.new( target_form )
P.S. The first line above is more simply achieved by target_form = page.at_css('#formid').