Answering a question, I noticed strange libxml behavior on the following expression:
//ancestor::*[#id][1]
for a given context node. I am trying to understand what the expression actually means.
Here is a snippet in PHP and result of its invocation:
$html = <<<HTML
<div id="div1">
<div id="div2">
<p id="p1">Content</p>
</div>
<div id="div3">
<p id="p2">Content</p>
</div>
</div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$contextNode = $doc->getElementById('p1');
$list = $xpath->query('//ancestor::*[#id][1]', $contextNode);
printf("NodeList length: %d\n", $list -> length);
foreach ($list as $node) {
printf ("item/#id -> %s\n", $node -> getAttribute('id'));
}
Result:
NodeList length: 5
item/#id -> div1
item/#id -> div2
item/#id -> p1
item/#id -> div3
item/#id -> p2
//ancestor::*[#id][1] is a short form for /descendant-or-self::node()/ancestor::*[#id][1] so the context node is only relevant for determining its root or document node /, then in the first step descendand-or-self::node() a node-set is formed of the document node and all its descendant nodes of all kinds (element nodes, text nodes, comment nodes, processing instruction nodes), then the next step for each of those nodes determines ancestor::*[#id][1], that is of all ancestor elements having an id attribute the first one.
Related
I am trying to create individual files from the nodes of a XML file. My issue is no matter what way I try it I seem to be getting stuck in a nested loop and I either keep rewriting each file until they are just the same node data over and over, or I run all of the nodes per loop instance. I'm sure this should be pretty easy but I'm getting hung up somewhere.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
item.xpath("//div[#class='meeting-date']/span/#content").each do |date|
date = date.to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
end
This is another attempt that I don't understand why is failing to create all the pages. This only creates one page, but if I use a "puts" the count does iterate through all 101 nodes.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
For further clarification, this is an example of the nodes that I'm trying to create into pages.
<?xml version="1.0" encoding="UTF-8" ?>
<nodes>
<node>
<no-name><div class="meeting-title">Meeting-a</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="a-2021-11-29T00:00:00-06:00">Monday, November 29, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-b</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="e-2021-09-10T00:00:00-05:00">Friday, September 10, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-c</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="f-2021-08-13T00:00:00-05:00">Friday, August 13, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
</nodes>
date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
By using // you are breaking out of the scope of the node you are iterating in. Removing the slashes you preserve the scope of the node.
date = item.xpath("no-name/div[#class='meeting-date']/span/#content").to_s
When you use w option it always rewrite onto the file. What you need is to create or append to the file, it's done with the a option. So you can try this:
File.open(split_date,'a'){ |f| f << item }
PS. Be sure that split_date as the name of the file is uniq for each node since you want a separate file per node
Any idea how i would get the text between 2 tags using Xpath code? specifically the 3, bd, 1, ba.
<p class="MuiTypography-root RoofCard__RoofCardNameStyled-niegej-8 hukPZu MuiTypography-body1" xpath="1">
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md">$65,000</span></p>
**"3" == $0
" bd, " == $0
"1" == $0
" ba | " == $0**
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md" xpath="1">926</span>
tried:
In fact from your sample that's a simple text() node after p:
//p/following-sibling::text()[1]
but of course you'll need to parse it. This will return almost that you need:
values = response.xpath('//p/following-sibling::text()[1]').re(r'"([^"]+)"')
I rarely use xpath() but when I do I keep tripping myself up on interpreting content of Nokogiri::Nodesets and believe I now know where I have always gone wrong.
Simply put when I do a 'puts NodeSet' I have always assumed that I could search the Nodeset based on the returned XML. But the first tag returned does not appear to actually part of the node XML.
'puts n1' returns XML that has a SPAN as the first element of the XML, but if I then do an search n1.xpath('SPAN') or n1.xpath('SPAN/DIV') no nodes are found. n1.xpath('DIV') returns the output I expect and proves no SPAN tag in the XML.
The only way I can logically explain this to myself is if assume that the first xml tag of a 'puts node' is the "Node Name" and not part of the node XML. This works for me going forward but am I missing something that is going to bite me elsewhere.
CODE:
docxml = Nokogiri::XML(<<EOT)
<DIV><SPAN><DIV id='1'><H1>-H1-</H1><h1>-h1-</h1></DIV>
<DIV id='2'><H2>-H2-</H2> <h2>-h2-</h2></DIV>
<DIV id='3'><H3>-H3-</H3><h3>-h3-</h3></DIV>
</SPAN></DIV>
EOT
n0 = docxml.xpath('DIV')
n1 = n0.xpath('SPAN')
n2 = n1.xpath('DIV')
n3 = n2.xpath('*')
n4 = n3.xpath('*')
puts "n1:xpath('SPAN'): \n#{n1.xpath('SPAN')}\n#{'^'*80} \nn1 XML:\n#{n1}\n#{'^'*80}\
\nn1:inspect \n#{n1.inspect}\n#{'^'*80}\n"
OUTPUT:
=begin
n1:xpath('SPAN'):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1 XML:
<SPAN>
<DIV id="1"> <H1>-H1-</H1> <h1>-h1-</h1> </DIV>
<DIV id="2"> <H2>-H2-</H2> <h2>-h2-</h2> </DIV>
<DIV id="3"> <H3>-H3-</H3> <h3>-h3-</h3> </DIV>
</SPAN>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1:inspect
[#<Nokogiri::XML::Element:0x1c10964 name="SPAN"
children=[
#<Nokogiri::XML::Element:0x1c10820 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x18fff90 name="id" value="1">]
children=[#<Nokogiri::XML::Element:0x1c1064c name="H1" children=[#<Nokogiri::XML::Text:0x1c1ffe8 "-H1-">]>,
#<Nokogiri::XML::Element:0x1c10604 name="h1" children=[#<Nokogiri::XML::Text:0x1c1fdcc "-h1-">]>
]>,
#<Nokogiri::XML::Element:0x1c107d8 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1fc10 name="id" value="2">]
children=[#<Nokogiri::XML::Element:0x1c105bc name="H2" children=[#<Nokogiri::XML::Text:0x1c1f874 "-H2-">]>,
#<Nokogiri::XML::Text:0x1c1f778 " ">,
#<Nokogiri::XML::Element:0x1c10574 name="h2" children=[#<Nokogiri::XML::Text:0x1c1f5f8 "-h2-">]
>]>,
#<Nokogiri::XML::Element:0x1c10790 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1f43c name="id" value="3">]
children=[#<Nokogiri::XML::Element:0x1c1052c name="H3" children=[#<Nokogiri::XML::Text:0x1c1f0a0 "-H3-">]>,
#<Nokogiri::XML::Element:0x1c104e4 name="h3" children=[#<Nokogiri::XML::Text:0x1c1ee90 "-h3-">]
>]
>]
>]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
=end
Now that I have had some sleep this works for me.
'nodeset = xpath(tag1/tag2)' returns a 'nodeset' containing member node 'tag2'
'puts nodeset' displays the 'tag2' node member
'nodeset.xpath('*')' returns the content of 'tag2
'nodeset.xpath('tag2')' invalid as 'tag2' is not part of the content of 'tag2'
In this picture of an html tree from the this picture of an html tree I only want the <div class="d"> node,but the <table> node and below is what I want to exclude from the <div class="d"> node.
well you can either manually pick them one by one by doing something like this
tablePath = "//div[#class='d']/table"
table = response.selector.xpath(tablePath ).extract(),
para_1_Path = "//div[#class='d']/p[5]"
para_1 = response.selector.xpath(para_1_Path).extract()
and so on
OR you can extract all of the div class="d" data and trim it but this would be tricky as you say you're new to scrapy.
Try using Xpath count:
count(preceding-sibling::table)>0
something like:
>>> import lxml.html
>>> s = '''
... <div class="d">
... <p style="text-align: center">...</p>
... <p>...</p>
... <h2>Daydream...</h2>
... <p>...</p>
... <p>...</p>
... <p>VRsat</p>
... <table><tbody><tr><td>...</td></tr></tbody></table>
... <p style="text-align: center">...</p>
... <p style="text-align: center">...</p>
... <div id="click_div">...</div>
... </div>
... '''
>>> doc = lxml.html.fromstring(s)
>>> xpath = '//div[#class="d"]/*[self::table or count(preceding-sibling::table)>0]'
>>> for x in doc.xpath(xpath): x.tag
...
'table'
'p'
'p'
'div'
UPDATE:
The OP is actually asking about the inverse from my solution above.
So, add not, switch to and, change the count to =0:
>>> xpath = '//div[#class="d"]/*[not(self::table) and count(preceding-sibling::table)=0]'
>>> for x in doc.xpath(xpath): x.tag
...
'p'
'p'
'h2'
'p'
'p'
'p'
I am using Html Agility Pack and are trying to extract the links and link text from the following html code. The webpage is fetched from a remote page and the saved locally as a whole. Then from this local webpage I am trying to extract the links and link text. The webpage naturally has other html code like other links text, etc inside its page but is removed here for clarity.
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open">
Description 1 text here</span> <span class="time">2012-01-20 08:35</span></a><br>
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open">
Description 2 text here</span> <span class="time">2012-01-20 09:35</span></a><br>
But the above are the most unique content to work from when trying to extract the links and linktext.
This is what I would like to see as the result
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305</link>
<title>Description 1 text here</title>
<pubDate>Wed, 20 Jan 2012 07:35:00 +0100</pubDate>
<link>/some/today.nsf/0/ EC8A39XXXX264X5BC125798B0029E312</link>
<title>Description 2 text here</title>
<pubDate> Wed, 20 Jan 2012 08:35:00 +0100</pubDate>
This is my code so far:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[starts-with(#class, 'Subject2')]")
(lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0)
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
Time = lnks. Attributes["time"].Value
};
foreach (var link in linksOnPage)
{
// Loop through.
Response.Write("<link>" + link.Url + "</link>");
Response.Write("<title>" + link.Text + "</title>");
Response.Write("<pubDate>" + link.Time + "</pubDate>");
}
And its not working, I am getting nothing.
So any suggestions and help would be highly appreciated.
Thanks in advance.
Update: I have managed to get the syntax correct now, in order to select the links from the above examples: With the following code:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[#class='Subject2']//a")
This selects the links nicely with url and text, but how do I go about also getting the time stamp?
That is, select the timestamp out of this:
<span class="time">2012-01-20 09:35</span></a>
which follows each link. And have that output with each link inside the output loop according to the above? Thanks for any help in regards to this.
Your HTML example is malformed, that's why you get unexpected results.
To find your first and second values you'll have to get the <a> inside your <span class='Subject2'> - the first value is a href attribute value, the second is InnerText of the anchor. To get the third value you'll have to get the following sibling of the <span class='Subject2'> tag and get its InnerText.
See, this how you can do it:
var nodes = document.DocumentNode.SelectNodes("//span[#class='Subject2']//a");
foreach (var node in nodes)
{
if (node.Attributes["href"] != null)
{
var link = new XElement("link", node.Attributes["href"].Value);
var description = new XElement("description", node.InnerText);
var timeNode = node.SelectSingleNode(
"..//following-sibling::span[#class='time']");
if (timeNode != null)
{
var time = new XElement("pubDate", timeNode.InnerText);
Response.Write(link);
Response.Write(description);
Response.Write(time);
}
}
}
this outputs something like:
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open</link>
<description>Description 1 text here</description>
<pubDate>2012-01-20 08:35</pubDate>
<link>/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open</link>
<description>Description 2 text here</description>
<pubDate>2012-01-20 09:35</pubDate>