Nokogiri: Filling in a default value for empty table cells - ruby

I'm trying to scrape the cell values from an HTML table. Randomly, some of these cells are empty, and I can't guess which ones with any reliability.
Is there a way to fill a default value in for Nokogiri when it comes across an empty cell?
Thanks for any advice you can provide. Here's my code:
def scrape_stats
stats = []
(2002..2012).to_a.each do |year|
url = "website/#{year}"
doc = Nokogiri::HTML(open(url))
rows = doc.at_css("body tbody").text.split(" ")
(rows.count / 25).times do |i| # there are 25 columns per row
stats << rows.shift(25)
end
end

It sounds like you want something like:
doc.search('td:empty').each{|n| n.content = 'default value'}

This would basically involve using the Nokogiri::XML::Node#add_child method (or the shorter version, Nokogiri::XML::Node#<<) to add a new child node containing the text you want to add to the empty cell.
See this question for an example:
How to add child nodes in NodeSet using Nokogiri

Related

How to iterate on select elements with Xpath with one exception?

I want to iterate over each selector found that contains a specific class in order to retrieve all elements within the divs. This works until it reaches one item containing an ID.
for selector in response.xpath("//div[#class='product-list-entry']"):
My best try to get around this is the following code:
for selector in response.xpath("//div[not(#id) and #class='product-list-entry']"):
Both versions lead to only retrieving 5 result sets instead of the full list.
How can I simply ignore the one with the id and iterate on all others?
This should extract the content of the specific divs (examples : text of the div, content of a span and text of a p element) :
def parse(self, response):
for selector in response.xpath("//div[#id='product-list']"):
content = selector.xpath(".//div[not(#id)]/text()").extract()
content2= selector.xpath(".//div[not(#id)]/span").extract()
content3= selector.xpath(".//div[not(#id)]/p/text()").extract()
content4= ...
print (content,content2,content3,...)

How do I create a child element within a Nokogiri node?

I’m using Rails 4.2.7 with Nokogiri. I’m having trouble creating a child node. I have the following code
general = doc.xpath("//lomimscc:general")
description = Nokogiri::XML::Node.new "lomimscc:description", doc
string = Nokogiri::XML::Node.new "lomimscc:string", doc
string.content = scenario.abstract
string['language'] = 'en'
description << string
general << description
I want the “description” element to be a child element of the “general” element (and similarly I want the “string” element to be a child of the “description” element). However what is happening is that the description element is appearing as a sibling of the general element. How do I make the element appear as a child instead of a sibling?
The tutorials show how to do this in "Creating new nodes", but the simple example is:
require 'nokogiri'
doc = Nokogiri::XML('<root/>')
doc.at('root').add_child('<foo/>')
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo/>\n</root>\n"
Nokogiri makes it easy to build nodes using a string that contains the markup or nodes you want to add.
You should be able to build upon this easily.
This is also noted throughout the Node documentation any place you see "node_or_tags".
When I changed
general = doc.xpath("//lomimscc:general")
to
general = doc.xpath("//lomimscc:general").first
then everything worked as far as creating child nodes.

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

has_css? will not find a present id

I am writing a finder for pages with various but finite id's
#field = ['name1', 'name2']
def fieldfind
#field.each do |elem|
out = elem if page.has_css?(elem)
end
end
HTML
<input type='text' id = 'name1'>
For whatever reason, I cannot find name1. I tried find_field? and elem.to_s, but to no avail. Any ideas?
As Baldrick mentioned, the css locator is not right. However, after correcting that, you would still get a problem with the #field.each. This is going to return an array - not an element or the css of the field that exists.
If you want an element that matches one of the css in #field, try:
#field = ['#name2', '#name1']
def fieldfind
matching_css = #field.find{ |elem| page.has_css?(elem) }
page.find(matching_css)
end
Or if you just want the matching css-locator:
#field = ['#name2', '#name1']
def fieldfind
#field.find{ |elem| page.has_css?(elem) }
end
You missing #, the css selector to find element by id. It should be:
#field = ['#name1', '#name2']
...
You can use:
page.find(:css, "input[id=name1]")
If that works, go ahead and dynamically add the variable to your code.
I'm suggesting you try to find an element first (rather than just improving your existing each block) is perhaps your session object is pointed towards the wrong window or frame.

Extracting HTML5 data attributes from a tag

I want to extract all the HTML5 data attributes from a tag, just like this jQuery plugin.
For example, given:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
I want to get a hash like:
{ 'data-age' => '50', 'data-location' => 'London' }
I was originally hoping use a wildcard as part of my CSS selector, e.g.
Nokogiri(html).css('span[#data-*]').size
but it seems that isn't supported.
Option 1: Grab all data elements
If all you need is to list all the page's data elements, here's a one-liner:
Hash[doc.xpath("//span/#*[starts-with(name(), 'data-')]").map{|e| [e.name,e.value]}]
Output:
{"data-age"=>"50", "data-location"=>"London"}
Option 2: Group results by tag
If you want to group your results by tag (perhaps you need to do additional processing on each tag), you can do the following:
tags = []
datasets = "#*[starts-with(name(), 'data-')]"
#If you want any element, replace "span" with "*"
doc.xpath("//span[#{datasets}]").each do |tag|
tags << Hash[tag.xpath(datasets).map{|a| [a.name,a.value]}]
end
Then tags is an array containing key-value hash pairs, grouped by tag.
Option 3: Behavior like the jQuery datasets plugin
If you'd prefer the plugin-like approach, the following will give you a dataset method on every Nokogiri node.
module Nokogiri
module XML
class Node
def dataset
Hash[self.xpath("#*[starts-with(name(), 'data-')]").map{|a| [a.name,a.value]}]
end
end
end
end
Then you can find the dataset for a single element:
doc.at_css("span").dataset
Or get the dataset for a group of elements:
doc.css("span").map(&:dataset)
Example:
The following is the behavior of the dataset method above. Given the following lines in the HTML:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
<span data-age="40" data-location="Oxford" class="highlight">Jim Foggs</span>
The output would be:
[
{"data-location"=>"London", "data-age"=>"50"},
{"data-location"=>"Oxford", "data-age"=>"40"}
]
You can do this with a bit of xpath:
doc = Nokogiri.HTML(html)
data_attrs = doc.xpath "//span/#*[starts-with(name(), 'data-')]"
This gets all the attributes of span elements that start with 'data-'. (You might want to do this in two steps, first to get all the elements you're interested in, then extract the data attributes from each in turn.
Continuing the example (using the span in your question):
hash = data_attrs.each_with_object({}) do |n, hsh|
hsh[n.name] = n.value
end
puts hash
produces:
{"data-age"=>"50", "data-location"=>"London"}
Try looping through element.attributes while ignoring any attribue that does not start with a data-.
The Node#css docs mention a way to attach a custom psuedo-selector. This might look like the following for selecting nodes with attributes starting with 'data-':
Nokogiri(html).css('span:regex_attrs("^data-.*")', Class.new {
def regex_attrs node_set, regex
node_set.find_all { |node| node.attributes.keys.any? {|k| k =~ /#{regex}/ } }
end
}.new)

Resources