output XML nodes out into individual files - ruby

I am trying to create individual files from the nodes of a XML file. My issue is no matter what way I try it I seem to be getting stuck in a nested loop and I either keep rewriting each file until they are just the same node data over and over, or I run all of the nodes per loop instance. I'm sure this should be pretty easy but I'm getting hung up somewhere.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
item.xpath("//div[#class='meeting-date']/span/#content").each do |date|
date = date.to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
end
This is another attempt that I don't understand why is failing to create all the pages. This only creates one page, but if I use a "puts" the count does iterate through all 101 nodes.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
For further clarification, this is an example of the nodes that I'm trying to create into pages.
<?xml version="1.0" encoding="UTF-8" ?>
<nodes>
<node>
<no-name><div class="meeting-title">Meeting-a</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="a-2021-11-29T00:00:00-06:00">Monday, November 29, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-b</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="e-2021-09-10T00:00:00-05:00">Friday, September 10, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-c</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="f-2021-08-13T00:00:00-05:00">Friday, August 13, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
</nodes>

date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
By using // you are breaking out of the scope of the node you are iterating in. Removing the slashes you preserve the scope of the node.
date = item.xpath("no-name/div[#class='meeting-date']/span/#content").to_s

When you use w option it always rewrite onto the file. What you need is to create or append to the file, it's done with the a option. So you can try this:
File.open(split_date,'a'){ |f| f << item }
PS. Be sure that split_date as the name of the file is uniq for each node since you want a separate file per node

Related

Clarification of Nokogiri::NodeSet XML Content based on 'puts node' and 'puts node.inspect'

I rarely use xpath() but when I do I keep tripping myself up on interpreting content of Nokogiri::Nodesets and believe I now know where I have always gone wrong.
Simply put when I do a 'puts NodeSet' I have always assumed that I could search the Nodeset based on the returned XML. But the first tag returned does not appear to actually part of the node XML.
'puts n1' returns XML that has a SPAN as the first element of the XML, but if I then do an search n1.xpath('SPAN') or n1.xpath('SPAN/DIV') no nodes are found. n1.xpath('DIV') returns the output I expect and proves no SPAN tag in the XML.
The only way I can logically explain this to myself is if assume that the first xml tag of a 'puts node' is the "Node Name" and not part of the node XML. This works for me going forward but am I missing something that is going to bite me elsewhere.
CODE:
docxml = Nokogiri::XML(<<EOT)
<DIV><SPAN><DIV id='1'><H1>-H1-</H1><h1>-h1-</h1></DIV>
<DIV id='2'><H2>-H2-</H2> <h2>-h2-</h2></DIV>
<DIV id='3'><H3>-H3-</H3><h3>-h3-</h3></DIV>
</SPAN></DIV>
EOT
n0 = docxml.xpath('DIV')
n1 = n0.xpath('SPAN')
n2 = n1.xpath('DIV')
n3 = n2.xpath('*')
n4 = n3.xpath('*')
puts "n1:xpath('SPAN'): \n#{n1.xpath('SPAN')}\n#{'^'*80} \nn1 XML:\n#{n1}\n#{'^'*80}\
\nn1:inspect \n#{n1.inspect}\n#{'^'*80}\n"
OUTPUT:
=begin
n1:xpath('SPAN'):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1 XML:
<SPAN>
<DIV id="1"> <H1>-H1-</H1> <h1>-h1-</h1> </DIV>
<DIV id="2"> <H2>-H2-</H2> <h2>-h2-</h2> </DIV>
<DIV id="3"> <H3>-H3-</H3> <h3>-h3-</h3> </DIV>
</SPAN>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1:inspect
[#<Nokogiri::XML::Element:0x1c10964 name="SPAN"
children=[
#<Nokogiri::XML::Element:0x1c10820 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x18fff90 name="id" value="1">]
children=[#<Nokogiri::XML::Element:0x1c1064c name="H1" children=[#<Nokogiri::XML::Text:0x1c1ffe8 "-H1-">]>,
#<Nokogiri::XML::Element:0x1c10604 name="h1" children=[#<Nokogiri::XML::Text:0x1c1fdcc "-h1-">]>
]>,
#<Nokogiri::XML::Element:0x1c107d8 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1fc10 name="id" value="2">]
children=[#<Nokogiri::XML::Element:0x1c105bc name="H2" children=[#<Nokogiri::XML::Text:0x1c1f874 "-H2-">]>,
#<Nokogiri::XML::Text:0x1c1f778 " ">,
#<Nokogiri::XML::Element:0x1c10574 name="h2" children=[#<Nokogiri::XML::Text:0x1c1f5f8 "-h2-">]
>]>,
#<Nokogiri::XML::Element:0x1c10790 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1f43c name="id" value="3">]
children=[#<Nokogiri::XML::Element:0x1c1052c name="H3" children=[#<Nokogiri::XML::Text:0x1c1f0a0 "-H3-">]>,
#<Nokogiri::XML::Element:0x1c104e4 name="h3" children=[#<Nokogiri::XML::Text:0x1c1ee90 "-h3-">]
>]
>]
>]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
=end
Now that I have had some sleep this works for me.
'nodeset = xpath(tag1/tag2)' returns a 'nodeset' containing member node 'tag2'
'puts nodeset' displays the 'tag2' node member
'nodeset.xpath('*')' returns the content of 'tag2
'nodeset.xpath('tag2')' invalid as 'tag2' is not part of the content of 'tag2'

Xpath : how do i exclude the nodes inside the node I want?

In this picture of an html tree from the this picture of an html tree I only want the <div class="d"> node,but the <table> node and below is what I want to exclude from the <div class="d"> node.
well you can either manually pick them one by one by doing something like this
tablePath = "//div[#class='d']/table"
table = response.selector.xpath(tablePath ).extract(),
para_1_Path = "//div[#class='d']/p[5]"
para_1 = response.selector.xpath(para_1_Path).extract()
and so on
OR you can extract all of the div class="d" data and trim it but this would be tricky as you say you're new to scrapy.
Try using Xpath count:
count(preceding-sibling::table)>0
something like:
>>> import lxml.html
>>> s = '''
... <div class="d">
... <p style="text-align: center">...</p>
... <p>...</p>
... <h2>Daydream...</h2>
... <p>...</p>
... <p>...</p>
... <p>VRsat</p>
... <table><tbody><tr><td>...</td></tr></tbody></table>
... <p style="text-align: center">...</p>
... <p style="text-align: center">...</p>
... <div id="click_div">...</div>
... </div>
... '''
>>> doc = lxml.html.fromstring(s)
>>> xpath = '//div[#class="d"]/*[self::table or count(preceding-sibling::table)>0]'
>>> for x in doc.xpath(xpath): x.tag
...
'table'
'p'
'p'
'div'
UPDATE:
The OP is actually asking about the inverse from my solution above.
So, add not, switch to and, change the count to =0:
>>> xpath = '//div[#class="d"]/*[not(self::table) and count(preceding-sibling::table)=0]'
>>> for x in doc.xpath(xpath): x.tag
...
'p'
'p'
'h2'
'p'
'p'
'p'

retrieve text from <p> on landing page using ruby watir

I have to retrieve the text from the web page and put it on console.
I am not able to get the text from this html below. Can anyone please help me on this.
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row">
</div>
I tried b.div(:class => 'twelve columns').exist? on irb and it says true.
I tried this - b.div(:class => 'twelve columns').text, and it returns me the text on the header not in paragraph.
I tried with - b.div(:class => 'twelve columns').p.text, it returned me error - unable to locate element, using {:tag_name=>"p"}
Simply doing this on example you wrote worked for me:
browser.div(:class => 'twelve columns').p.text
Your best bet would be to check your page css for actually having provided elements structure, as well as that they are nested properly.
I slightly fixed you HTML:
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row"></div>
</div>
Let's do a tiny example:
div = b.div(:class => 'twelve columns')
Enumeration of elements as follows:
div.elements.each do |e|
p e
end
Will do something like that:
<Watir::HTMLElement ... # <h1>Your product</h1>
<Watir::HTMLElement ... # <p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<Watir::HTMLElement ... #<div class="row">
If you want to specify child element P from the DIV do this:
p = div.p
or
p = div.element( :tag_name => 'p' )
And when get text of P:
p.text # >> 21598: DECLINE: Decline - Property Type not acceptable under this contract
Or event do with your single string:
b.div(:class => 'twelve columns').p.text
=> "21598: DECLINE: Decline - Property Type not acceptable under this contract"

Get Text between two tags using nokogiri

My HTML structure is
<div class="line">
<h2>Header</h2>
<h3>Mailing Address</h3>
2349 Glorem ipsun lorem ipsum CA 95833<br>
<br>
Phone: 111-111-2111 Fax: 111-511-1111<br>
<a onfocus="blur()" target="_blank"" href="">some text</a><br>
<a onfocus="blur()" target="_blank" href="">some address</a><br>
<div><p></p></div>
<h3>Contact(s)</h3>
</div>
The HTML page contains several <div class=line></div> elements. For each div i need to extract Phone and Fax in a array with other data. I tried using
doc.css("div#ctl00_cphContent_divBrowseByMember").each do |div|
div.css("div.line").each do |line|
line.xpath('//text()[preceding-sibling::br and following-sibling::a]').text.strip
end
end
It returns nothing and returns time out error.
If I try as
line.xpath('//text()[preceding-sibling::br and following-sibling::a]')[0].text.strip
will return same Phone and fax for all other divs. Please suggest any other solution that will help me.
The easy way:
phone, fax = line.text.scan /\d{3}-\d{3}-\d{4}/

Locating element in same paragraph of another element in watir-webdriver

Given the following HTML code snippet; after finding the link by ID, how would you select the checkbox in the same paragraph?
For example if I wanted to select the checkbox associated with the link with ID="inst_17901-1746-1747".
The order of the paragraphs in the DIV is not consistent between sessions so I cannot select it by index or ID of the checkbox.
<div id="inst-results">
<p>
<input id="inst-results0-check" type="checkbox">
<a class="ws-rendered" id="inst_17901-1746-1747" title="!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
<p>
<input id="inst-results1-check" type="checkbox"><a class="ws-rendered" id="inst_17882-1746-1747" title="!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
</div>
I figured out this solution working off the text of the link, but Zeljko solution is much better.
$browser.div(:id,"inst-results").ps.each { |para|
if para.link.text == "!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range" then
para.checkbox.set
break
end
}
If there is only one checkbox in the paragraph with the link:
browser.link(:id => "inst_17901-1746-1747").parent.checkbox.set
Works with watir-webdriver, not sure if it would work with other Watir gems.

Resources