Using Nokogiri with multiple search elements - ruby

In this XML snippet I need to replace the data in the UID for some of the blocks. The actual file contains more than 100 similar blocks.
Although I have been able to extract subsets based on name="Track (Timeline)", I am struggling to reduce this subset to the specific block I need by also using the data in the <TrackID>, if name="Track (TimeLine)" and the text of <TrackID> is 0x1200 then set UID to xxxx.
I am new to Nokogiri and, although I write test scripts, I do not consider myself a programmer.
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>32-04-25-67-E7-A7-86-4A-9B-28-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1200</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x17010101</TrackNumber>
<UID>34-C1-B9-B9-5F-07-A4-4E-8F-F4-53-6F-66-74-65-6C</UID>
</StructuralMetadata>
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>35-12-2D-86-E6-74-0B-4C-B4-24-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1300</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x0</TrackNumber>
<UID>37-0C-80-34-4C-8D-CE-41-85-F3-53-6F-66-74-65-6C</UID>
</StructuralMetadata>

Using xpath:
//StructuralMetadata
will select all StructuralMetadata elements in your XML. The double slash at the start means to select nodes wherever they appear in the document.
You don't want all the nodes though, you can filter the ones you want with a predicate:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]
This will select all StructuralMetadata elements that have a name attribute with the value Track (TimeLine), and a TrackID child element with contents 0x1200.
As you're interested in the UID element, you can further refine the expression:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID
This expression will match all the UID elements that are children of StructuralMetadata elements that match the predicate described above.
Putting this to use:
require 'nokogiri'
# Parse the document, assuming xml_file is a File object containing the XML
doc = Nokogiri::XML(xml_file)
# I'm assuming there is only one element in the document that matches
# the criteria, so I'm using at_xpath
node = doc.at_xpath('//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID')
# At this point, doc contains a representation of the xml, and node points to
# the UID node within that representation. We can update the contents of
# this node
node.content = 'XXX'
# Now write out the updated XML. This just writes it to standard output,
# you could write it to a file or elsewhere if needed
puts doc.to_xml

A great way to approach this problem is with the ‘map reduce’ style of programming, which works to take a large list of things and narrow it down and combine it into the result you're after. Specifically, Array#find and Array#select are really useful for this sort of problem. Check out this example:
require 'nokogiri'
xml = Nokogiri::XML.parse(File.read "sample.xml")
element = xml.css('StructuralMetadata').find { |item|
item['name'] == "Track (TimeLine)" and item.css('TrackID').text == "0x1200"
}
puts element.to_xml
This little program first uses the CSS selector to get all of the <StructuralMetadata> elements in the document. It returns an array, which we can filter to just what we want using the Array#find method. Array#select is its cousin which returns an array of all the matching objects instead of the first one it happens to find.
Inside the block we have a test to check if the <StructuralMetadata> tag is the one we’re after. Then it puts the element.to_xml string to the console so you can see which thing it found if you run this as a command-line script. Now you can find the element, you can modify it in the usual way and save out a new XML file or whatever.

Related

Extract a specific node from an XML file

I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?
I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.

Parsing one large array into several sub-arrays

I have a list of adjectives (found here), that I would like to be the basis for a "random_adjective(category)" method.
I'm really just taking a stab at this, as my first real attempt at a useful program.
Step 1: Open file, remove formatting. No problem.
list=File.read('adjectivelist')
list.gsub(/\n/, " ")
The next step is to break the string up by category..
list.split(" ")
Now I have an array of every word in the file. Neat. The ones with a tilde before them represent the category names.
Now I would like to break up this LARGE array into several smaller ones, based on category.
I need help with the syntax here, although the pseudocode for this would be something like
Scan the array for an element which begins with a tilde.
Now create a new array based on the name of that element sans the tilde, and ALSO place this "category name" into the "categories" array. Now pull all the elements from the main array, and pop them into the sub-array, until you meet another tilde. Then repeat the process until there are no more elements in the array.
Finally I would pull a random word from the category named in the parameter. If there was no category name matching the parameter, it would return false and exit (this is simply in case I want to add more categories later.)
Tips would be appreciated
You may want to go back and split first time around like this:
categories = list.split(" ~")
Then each list item will start with the category name. This will save you having to go back through your data structure as you suggest. Consider that a tip: sometimes it's better to re-think the start of a coding problem than to head inexorably forwards
The structure you are reaching towards is probably a Hash, where the keys are category names, and the values are arrays of all the matching adjectives. It might look like this:
{
'category' => [ 'word1', 'word2', 'word3' ]
}
So you might do this:
words_in_category = Hash.new
categories.each do |category_string|
cat_name, *words = category_string.split(" ")
words_in_category[cat_name] = words
end
Finally, to pick a random element from an array, Ruby provides a very useful method sample, so you can just do this
words_in_category[ chosen_category ].sample
. . . assuming chosen_category contains the string name of an actual category. I'll leave it to you to figure out how to put this all together and handle errors, bad input etc
Use slice_before:
categories = list.split(" ").slice_before(/~\w+/)
This will create an sub array for each word starting with ~, containing all words before the next matching word.
If this file format is your original and you have freedom to change it, then I recommend you save the data as yaml or json format and read it when needed. There are libraries to do this. That is all. No worry about the mess. Don't spend time reinventing the wheel.

Scraping a website with Nokogiri

I am using Nokogiri to scrape a website and am running into an issue when I try to grab a field from a table. I am using selector gadget to find the CSS selector of the table. I am grabbing data from a government website that details information on motor carriers.
The method that I am using looks like:
def scrape_database
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=#{self.dot}#Inspections"
doc = Nokogiri::HTML(open(url))
self.name = doc.at_css("tr:nth-child(4) .queryfield").text
self.address = doc.at_css("tr:nth-child(6) .queryfield").text
end
I grab all of the fields in the upper table using that syntax and the method operates fine, however I am having issues with the crash rate/inspections table below it.
Here is what I am using to grab that info:
self.vehicle_inspections = doc.at_css("center:nth-child(13) tr:nth-child(2) :nth-child(2)").text
undefined method `text' for nil:NilClass
If I remove text from the end of this, the method runs but doesn't grab any relevant information (obviously). I am assuming this is due to the complicated selector that I am using to grab the field, but am not quite sure.
Has anyone run into a similar problem and can you give me some advice?
Yes, that error means that your CSS selector is not finding the information; at_css is returning nil, and nil.text is not valid. You can guard against it like so:
insp = doc.at_css("long example css selector")
self.vehicle_inspections = insp && insp.text
However, it sounds to me like you "need" this data. Since you have not provided with the HTML page nor the CSS selectors, I can't help you craft a working CSS or XPath selector.
For future questions, or an edit to this one, note that actual (pared-down) code is strongly preferred over hand waving and loose descriptions of what your code looks like. If you show us the HTML page, or a relevant snippet, and describe which element/text/attribute you want, we can tell you how to select it.
I see six tables on that page. Which is the "crash rate/inspections" table? Given that your URL includes #Inspections on the end, I'm assuming you're talking about the two tables immediately underneath the "Inspections/Crashes In US" section. Here are XPath selectors that match each:
require 'nokogiri'
require 'open-uri'
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=800585"
doc = Nokogiri::HTML(open(url))
table1 = doc.at_xpath('//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]')
table2 = doc.at_xpath('//table[#summary="Crashes"][preceding::h4[.//a[#name="Inspections"]]]')
# Find a row by index (1 is the first row)
vehicle_inspections = table1.at_xpath('.//tr[2]/td').text.to_i
# Find a row by header text
out_of_service_drivers = table1.at_xpath('.//tr[th="Out of Service"]/td[2]').text.to_i
p [ vehicle_inspections, out_of_service_drivers ]
#=> [6, 0]
tow_crashes = table2.at_xpath('.//tr[th="Crashes"]/td[3]').text.to_i
p tow_crashes
#=> 0
The XPath queries may look intimidating. Let me explain how they work:
//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]
//table find a <table> at any level of the document
[#summary="Inspections"] …but only if it has a summary attribute with this value
[preceding::h4…] …and only if you can find an <h4> element earlier in the document
[.//a…] …specifically, a <h4> that has an <a> somewhere underneath it
[#name="Inspections"] …and that <a> has to have a name attribute with this text.
This would actually match two tables (there's another summary="Inspections" table later on the page), but using at_xpath finds the first matching table.
.//tr[2]/td
. Starting at the current node (this table)
//tr[2] …find the second <tr> that is a descendant at any level
/td …and then find the <td> children of that.
Again, because we're using at_xpath we find the first matching <td>.
.//tr[th="Out of Service"]/td[2]
. Starting at the current node (this table)
//tr …find any <tr> that is a descendant at any level
[th="Out of Service] …but only those <tr> that have a <th> child with this text
/td[2] …and then find the second <td> children of those.
In this case there is only one <tr> that matches the criteria, and thus only one <td> that matches, but we still use at_xpath so that we get that node directly instead of a NodeSet with a single element in it.
The goal here (and with any screen scraping) is to latch onto meaningful values on the page, not arbitrary indices.
For example, I could have written my table1 xpath as:
# Find the first table with this summary
table1 = doc.at_xpath('//table[#summary="Inspections"][1]')
…or even…
# Find the 20th table on the page
//table[20]
However, those are fragile. Someone adding a new section to the page, or code that happens to add or remove a formatting table would cause those expressions to break. You want to hunt for strong attributes and text that likely won't change, and anchor your searches based on that.
The vehicle_inspections XPath is similarly fragile, relying on the ordering of rows instead of the label text for the row.

Can't get nth node in Selenium

I try to write xpath expressions so that my tests won't be broken by small design changes. So instead of the expressions that Selenium IDE generates, I write my own.
Here's an issue:
//input[#name='question'][7]
This expression doesn't work at all. Input nodes named 'question' are spread across the page. They're not siblings.
I've tried using intermediate expression, but it also fails.
(//input[#name='question'])[2]
error = Error: Element (//input[#name='question'])[2] not found
That's why I suppose Seleniun has a wrong implementation of XPath.
According to XPath docs, the position predicate must filter by the position in the nodeset, so it must find the seventh input with the name 'question'. In Selenium this doesn't work. CSS selectors (:nth-of-kind) neither.
I had to write an expression that filters their common parents:
//*[contains(#class, 'question_section')][7]//input[#name='question']
Is this a Selenium specific issue, or I'm reading the specs wrong way? What can I do to make a shorter expression?
Here's an issue:
//input[#name='question'][7]
This expression doesn't work at all.
This is a FAQ.
[] has a higher priority than //.
The above expression selects every input element with #name = 'question', which is the 7th child of its parent -- and aparently the parents of input elements in the document that is not shown don't have so many input children.
Use (note the brackets):
(//input[#name='question'])[7]
This selects the 7th element input in the document that satisfies the conditions in the predicate.
Edit:
People, who know Selenium (Dave Hunt) suggest that the above expression is written in Selenium as:
xpath=(//input[#name='question'])[7]
If you want the 7th input with name attribute with a value of question in the source then try the following:
/descendant::input[#name='question'][7]

How do I use XPath in Nokogiri?

I have not found any documentation nor tutorial for that. Does anything like that exist?
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
The code above will get me any table, anywhere, that has a tbody child with the attribute id equal to "threadbits_forum_251". But why does it start with double //? Why there is /tr at the end? See "Ruby Nokogiri Parsing HTML table II" for more details.
Can anybody tell me how to extract href, id, alt, src, etc., using Nokogiri?
td[3]/div[1]/a/text()' <--- extracts text
How can I extract other things?
Seems you need to read a XPath Tutorial
Your //table/tbody[#id="threadbits_forum_251"]/tr expression means:
// - Anywhere in your XML document
table/tbody - take a table element with a tbody child
[#id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
tr - and take its tr elements
So, basically, you need to know:
attributes begins with #
conditions go inside [] brackets
If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/#href if there is just one <a> element.
Your XPath is correct and you seem to have answered your own question's first part (almost):
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
"the code above will get me any table table's tr, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251"
// means the following element can appear anywhere in the document.
/tr at the end means, get the tr node of the matching element.
You dont need to extract each attribute one by one. Just get the entire node containing all four attributes in Nokogiri, and get the attributes using:
theNode['href']
theNode['src']
Where theNode is your Nokogiri Node object.
Edit:
Sorry I haven't used these libraries, but I think the XPath evaluation and parsing is being done by Mechanize. So here's how you would get the entire element and its attributes in one go.
doc.xpath("td[3]/div[1]/a").each do |anchor|
puts anchor['href']
puts anchor['src']
...
end

Resources