I am trying to create individual files from the nodes of a XML file. My issue is no matter what way I try it I seem to be getting stuck in a nested loop and I either keep rewriting each file until they are just the same node data over and over, or I run all of the nodes per loop instance. I'm sure this should be pretty easy but I'm getting hung up somewhere.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
item.xpath("//div[#class='meeting-date']/span/#content").each do |date|
date = date.to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
end
This is another attempt that I don't understand why is failing to create all the pages. This only creates one page, but if I use a "puts" the count does iterate through all 101 nodes.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
For further clarification, this is an example of the nodes that I'm trying to create into pages.
<?xml version="1.0" encoding="UTF-8" ?>
<nodes>
<node>
<no-name><div class="meeting-title">Meeting-a</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="a-2021-11-29T00:00:00-06:00">Monday, November 29, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-b</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="e-2021-09-10T00:00:00-05:00">Friday, September 10, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-c</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="f-2021-08-13T00:00:00-05:00">Friday, August 13, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
</nodes>
date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
By using // you are breaking out of the scope of the node you are iterating in. Removing the slashes you preserve the scope of the node.
date = item.xpath("no-name/div[#class='meeting-date']/span/#content").to_s
When you use w option it always rewrite onto the file. What you need is to create or append to the file, it's done with the a option. So you can try this:
File.open(split_date,'a'){ |f| f << item }
PS. Be sure that split_date as the name of the file is uniq for each node since you want a separate file per node
Related
I rarely use xpath() but when I do I keep tripping myself up on interpreting content of Nokogiri::Nodesets and believe I now know where I have always gone wrong.
Simply put when I do a 'puts NodeSet' I have always assumed that I could search the Nodeset based on the returned XML. But the first tag returned does not appear to actually part of the node XML.
'puts n1' returns XML that has a SPAN as the first element of the XML, but if I then do an search n1.xpath('SPAN') or n1.xpath('SPAN/DIV') no nodes are found. n1.xpath('DIV') returns the output I expect and proves no SPAN tag in the XML.
The only way I can logically explain this to myself is if assume that the first xml tag of a 'puts node' is the "Node Name" and not part of the node XML. This works for me going forward but am I missing something that is going to bite me elsewhere.
CODE:
docxml = Nokogiri::XML(<<EOT)
<DIV><SPAN><DIV id='1'><H1>-H1-</H1><h1>-h1-</h1></DIV>
<DIV id='2'><H2>-H2-</H2> <h2>-h2-</h2></DIV>
<DIV id='3'><H3>-H3-</H3><h3>-h3-</h3></DIV>
</SPAN></DIV>
EOT
n0 = docxml.xpath('DIV')
n1 = n0.xpath('SPAN')
n2 = n1.xpath('DIV')
n3 = n2.xpath('*')
n4 = n3.xpath('*')
puts "n1:xpath('SPAN'): \n#{n1.xpath('SPAN')}\n#{'^'*80} \nn1 XML:\n#{n1}\n#{'^'*80}\
\nn1:inspect \n#{n1.inspect}\n#{'^'*80}\n"
OUTPUT:
=begin
n1:xpath('SPAN'):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1 XML:
<SPAN>
<DIV id="1"> <H1>-H1-</H1> <h1>-h1-</h1> </DIV>
<DIV id="2"> <H2>-H2-</H2> <h2>-h2-</h2> </DIV>
<DIV id="3"> <H3>-H3-</H3> <h3>-h3-</h3> </DIV>
</SPAN>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1:inspect
[#<Nokogiri::XML::Element:0x1c10964 name="SPAN"
children=[
#<Nokogiri::XML::Element:0x1c10820 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x18fff90 name="id" value="1">]
children=[#<Nokogiri::XML::Element:0x1c1064c name="H1" children=[#<Nokogiri::XML::Text:0x1c1ffe8 "-H1-">]>,
#<Nokogiri::XML::Element:0x1c10604 name="h1" children=[#<Nokogiri::XML::Text:0x1c1fdcc "-h1-">]>
]>,
#<Nokogiri::XML::Element:0x1c107d8 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1fc10 name="id" value="2">]
children=[#<Nokogiri::XML::Element:0x1c105bc name="H2" children=[#<Nokogiri::XML::Text:0x1c1f874 "-H2-">]>,
#<Nokogiri::XML::Text:0x1c1f778 " ">,
#<Nokogiri::XML::Element:0x1c10574 name="h2" children=[#<Nokogiri::XML::Text:0x1c1f5f8 "-h2-">]
>]>,
#<Nokogiri::XML::Element:0x1c10790 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1f43c name="id" value="3">]
children=[#<Nokogiri::XML::Element:0x1c1052c name="H3" children=[#<Nokogiri::XML::Text:0x1c1f0a0 "-H3-">]>,
#<Nokogiri::XML::Element:0x1c104e4 name="h3" children=[#<Nokogiri::XML::Text:0x1c1ee90 "-h3-">]
>]
>]
>]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
=end
Now that I have had some sleep this works for me.
'nodeset = xpath(tag1/tag2)' returns a 'nodeset' containing member node 'tag2'
'puts nodeset' displays the 'tag2' node member
'nodeset.xpath('*')' returns the content of 'tag2
'nodeset.xpath('tag2')' invalid as 'tag2' is not part of the content of 'tag2'
In this picture of an html tree from the this picture of an html tree I only want the <div class="d"> node,but the <table> node and below is what I want to exclude from the <div class="d"> node.
well you can either manually pick them one by one by doing something like this
tablePath = "//div[#class='d']/table"
table = response.selector.xpath(tablePath ).extract(),
para_1_Path = "//div[#class='d']/p[5]"
para_1 = response.selector.xpath(para_1_Path).extract()
and so on
OR you can extract all of the div class="d" data and trim it but this would be tricky as you say you're new to scrapy.
Try using Xpath count:
count(preceding-sibling::table)>0
something like:
>>> import lxml.html
>>> s = '''
... <div class="d">
... <p style="text-align: center">...</p>
... <p>...</p>
... <h2>Daydream...</h2>
... <p>...</p>
... <p>...</p>
... <p>VRsat</p>
... <table><tbody><tr><td>...</td></tr></tbody></table>
... <p style="text-align: center">...</p>
... <p style="text-align: center">...</p>
... <div id="click_div">...</div>
... </div>
... '''
>>> doc = lxml.html.fromstring(s)
>>> xpath = '//div[#class="d"]/*[self::table or count(preceding-sibling::table)>0]'
>>> for x in doc.xpath(xpath): x.tag
...
'table'
'p'
'p'
'div'
UPDATE:
The OP is actually asking about the inverse from my solution above.
So, add not, switch to and, change the count to =0:
>>> xpath = '//div[#class="d"]/*[not(self::table) and count(preceding-sibling::table)=0]'
>>> for x in doc.xpath(xpath): x.tag
...
'p'
'p'
'h2'
'p'
'p'
'p'
I have to retrieve the text from the web page and put it on console.
I am not able to get the text from this html below. Can anyone please help me on this.
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row">
</div>
I tried b.div(:class => 'twelve columns').exist? on irb and it says true.
I tried this - b.div(:class => 'twelve columns').text, and it returns me the text on the header not in paragraph.
I tried with - b.div(:class => 'twelve columns').p.text, it returned me error - unable to locate element, using {:tag_name=>"p"}
Simply doing this on example you wrote worked for me:
browser.div(:class => 'twelve columns').p.text
Your best bet would be to check your page css for actually having provided elements structure, as well as that they are nested properly.
I slightly fixed you HTML:
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row"></div>
</div>
Let's do a tiny example:
div = b.div(:class => 'twelve columns')
Enumeration of elements as follows:
div.elements.each do |e|
p e
end
Will do something like that:
<Watir::HTMLElement ... # <h1>Your product</h1>
<Watir::HTMLElement ... # <p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<Watir::HTMLElement ... #<div class="row">
If you want to specify child element P from the DIV do this:
p = div.p
or
p = div.element( :tag_name => 'p' )
And when get text of P:
p.text # >> 21598: DECLINE: Decline - Property Type not acceptable under this contract
Or event do with your single string:
b.div(:class => 'twelve columns').p.text
=> "21598: DECLINE: Decline - Property Type not acceptable under this contract"
My HTML structure is
<div class="line">
<h2>Header</h2>
<h3>Mailing Address</h3>
2349 Glorem ipsun lorem ipsum CA 95833<br>
<br>
Phone: 111-111-2111 Fax: 111-511-1111<br>
<a onfocus="blur()" target="_blank"" href="">some text</a><br>
<a onfocus="blur()" target="_blank" href="">some address</a><br>
<div><p></p></div>
<h3>Contact(s)</h3>
</div>
The HTML page contains several <div class=line></div> elements. For each div i need to extract Phone and Fax in a array with other data. I tried using
doc.css("div#ctl00_cphContent_divBrowseByMember").each do |div|
div.css("div.line").each do |line|
line.xpath('//text()[preceding-sibling::br and following-sibling::a]').text.strip
end
end
It returns nothing and returns time out error.
If I try as
line.xpath('//text()[preceding-sibling::br and following-sibling::a]')[0].text.strip
will return same Phone and fax for all other divs. Please suggest any other solution that will help me.
The easy way:
phone, fax = line.text.scan /\d{3}-\d{3}-\d{4}/
Given the following HTML code snippet; after finding the link by ID, how would you select the checkbox in the same paragraph?
For example if I wanted to select the checkbox associated with the link with ID="inst_17901-1746-1747".
The order of the paragraphs in the DIV is not consistent between sessions so I cannot select it by index or ID of the checkbox.
<div id="inst-results">
<p>
<input id="inst-results0-check" type="checkbox">
<a class="ws-rendered" id="inst_17901-1746-1747" title="!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
<p>
<input id="inst-results1-check" type="checkbox"><a class="ws-rendered" id="inst_17882-1746-1747" title="!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
</div>
I figured out this solution working off the text of the link, but Zeljko solution is much better.
$browser.div(:id,"inst-results").ps.each { |para|
if para.link.text == "!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range" then
para.checkbox.set
break
end
}
If there is only one checkbox in the paragraph with the link:
browser.link(:id => "inst_17901-1746-1747").parent.checkbox.set
Works with watir-webdriver, not sure if it would work with other Watir gems.