i am building a webscraper to get the information of a webpage. i want the correct xpath notation to get the information .
<div class="inner">
<div class="col">
<h2>Land in Kadawatha</h2>
<div class="meta">
<div class="date"></div>
<span class="category">Other Lands</span>,
<span class="location">Gampaha</span>
</div>
</div>
how do i access the "Land in Kadawatha" using a xpath.
Standalone XPath 1 without xsl:
//div[contains(concat(" ", #class, " "), " inner ")]/div[contains(concat(" ", #class, " "), " col ")]/h2[1]/a
Using this function:
<xsl:function name="markup:has-class" as="xs:boolean">
<xsl:param name="el" as="element()" />
<xsl:param name="class-name" as="item()" />
<xsl:sequence select="$el/#class and tokenize(upper-case(normalize-space($el/#class)), ' ') = upper-case(string($class-name))" />
</xsl:function>
You can do:
*[markup:has-class(., 'inner')]/*[markup:has-class(., 'col')]//h2/string()
Adjust accordingly depending on your context node.
Based on that snippet
//div[#class='col']/h2/a
then your code would look like
IWebElement element = driver.FindElement(By.XPath("//div[#class='col']/h2/a"));
string elementText = element.Text();
Related
I have been trying to figure this out for a while and can't get my head around it. I have tried using following-sibling but it's not working for me. The classes are really generic across the board. I was trying to use the text within the <strong> tag to identify then pull the sibling content:
<div class="generic-class">
<p class="generic-class2">
<strong>Content title</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title2</strong>
"
Needed Content "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title3</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title4</strong>
"
Dont Need "
<br>
</p>
</div>
I tried using the below but with no success, I did then realise that the text is actually in the <p> tag so it's not a sibling.:
normalize-space(//*[#class="generic-class"]/p/strong/following-sibling::text())
Would there be a way of me finding the text in the <strong> tag "Content title2" and then getting the text in the parent?
Any help would be amazing, thanks!
This one should return "Needed Content":
normalize-space(//p/strong[.="Content title2"]/following-sibling::text())
I have mentioned the same XML in below .
XML :1
<PP XML="2000_4_174.xml">
<P name="Antony" value="IN"/>
<P name="sitting" value="17 AUGUST 2000"/>
<P name="type" value="reported"/>
<P name="startpage" value="174"/>
</PP>
XML :2
<PP XML="2000_4_17411.xml">
<P name="Antony" value="IN"/>
<P name="sitting" value="17 AUGUST 2000"/>
<P name="type" value="reported"/>
<P name="startpage" value="1"/>
</PP>
I have using different condition of in Xpath query for getting #XML value condition(#name ="Antony" and #value="IN" and #name ="startpage" and #value="174") so expect output is (2000_4_174.xml)
I have tried this Query please suggest me how to add the another two conditions.
let $uri := //PP/P[#name="Antony" and #value="IN"]
for $i in $uri
let $j := xdmp:node-uri($i)
let $s :=doc($j)/PP/#XML
return $s
XPath for that would be :
//PP[P[#name="Antony" and #value="IN"]
and
P[#name="startpage" and #value="174"]
]/#XML
The XPath should return XML attribute of <PP> element which contains the following child elements :
<P name="Antony" value="IN"/>
<P name="startpage" value="174"/>
Does anyone know if there is a native method for printing the attributes of a Nokogiri::XML::Node without innerHTML or text content.
For example, given the following Nokogiri::XML::Node:
<div id="customer" class="highlighted">
<h1>Customer Name</h1>
<p>Some customer description</p>
</div>
I would like to print only:
<div id="customer" class="highlighted">
or
<div id="customer" class="highlighted"/>
or
<div id="customer" class="highlighted"></div>
I know I could simply loop through the list of attributes using the attributes method, but I was wondering if Nokogiri already supports something like this natively.
You could output the node with its content deleted:
doc = Nokogiri::HTML.fragment(
'<div id="customer" class="highlighted">
<h1>Customer Name</h1>
<p>Some customer description</p>
</div>'
)
node = doc.at_css('#customer').clone
node.content = nil
p node.to_html
#=> "<div id=\"customer\" class=\"highlighted\"></div>"
Given this HTML:
<li class="check_boxes input optional" id="activity_roles_input">
<fieldset class="choices">
<legend class="label"><label>Roles</label></legend>
<input id="activity_roles_none" name="activity[role_ids][]" type="hidden" value="" />
<ol class="choices-group">
<li class="choice">
<label for="activity_role_ids_104">
<input id="activity_role_ids_104" name="activity[role_ids][]" type="checkbox" value="104" />Language Therapist
</label>
</li>
<li class="choice">
<label for="activity_role_ids_103">
<input id="activity_role_ids_103" name="activity[role_ids][]" type="checkbox" value="103" />Speech Therapist
</label>
</li>
</ol>
</fieldset>
</li>
I am trying to use Selenium and xpath with it.
I am trying to select the first 'checkbox' input element link.
I am having problems selecting the element.
I cannot use the db ID (104) as this is for repeated tests with new ID's each time. I need to select the 'first' input checkbox, based on it having the text for Language Therapist.
I have tried:
xpath=(//li[contains(#id,'activity_roles_input')])//input
and
xpath=(//li[contains(#id,'activity_roles_input')])//contains('Language Therapist")
but it is not finding the element.
When I do:
xpath=(//li[contains(#id,'activity_roles_input')])
it gets to the input set. The problem I am having is selecting the first input checkbox control for 'Language Therapist'.
First, find any <li> containing the text and than look for in the descendant of those for the first checkbox.
xpath=(//li[contains(., "Language Therapist")]/descendant::input[#type="checkbox"][1])
(From Michael)
The above worked for me. In the end I actually used
xpath=(//li[contains(#id,'activity_roles_input')]/descendant::input[#type="checkbox"][1])
becuase I liked ID'ing by css ID.
interesting fact to notice when I try to run this small xsl against your xml.
XSL:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="//li[#id ='activity_roles_input']">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Output:
Roles
Language Therapist
Speech Therapist
You have
xpath=(//li[contains(#id,'activity_roles_input')])//input
Shouldn't that be
xpath=(//li[contains(#id,'activity_roles_input')]//input)
or rather
xpath=(//li[#id='activity_roles_input']//input)
?
xpath=(//li[#id='activity_roles_input']//input[1])
I have a string (#description) that contains HTML code and I want to extract the content between two elements. It looks something like this
<b>Content title<b><br/>
*All the content I want to extract*
<a href="javascript:print()">
I've managed to do something like this
#want = #description.match(/Content title(.*?)javascript:print()/m)[1].strip
But obviously this solution is far from perfect as I get some unwanted characters in my #want string.
Thanks for your help
Edit:
As requested in the comments, here is the full code:
I'm already parsing an HTML document doing something where the following code:
#description = #doc.at_css(".entry-content").to_s
puts #description
returns:
<div class="post-body entry-content">
<img alt="Photo title" height="333" src="http://photourl.com" width="500"><br><br><div style="text-align: justify;">
Some text</div>
<b>More text</b><br><b>More text</b><br><br><ul>
<li>Numered item</li>
<li>Numered item</li>
<li>Numered item</li>
</ul>
<br><b>Content Title</b><br>
Some text<br><br>
Some text(with links and images)<br>
Some text(with links and images)<br>
Some text(with links and images)<br>
<br><br><img src="http://url.com/photo.jpg">
<div style="clear: both;"></div>
</div>
The text can include more paragraphs, links, images, etc. but it always starts with the "Content Title" part and ends with the javascript reference.
This XPath expression selects all (sibling) nodes between the nodes $vStart and $vEnd:
$vStart/following-sibling::node()
[count(.|$vEnd/preceding-sibling::node())
=
count($vEnd/preceding-sibling::node())
]
To obtain the full XPath expression to use in your specific case, simply substitute $vStart with:
/*/b[. = 'Content Title']
and substitute $vEnd with:
/*/a[#href = 'javascript:print()']
The final XPath expressions after the substitutions is:
/*/b[. = 'Content Title']/following-sibling::node()
[count(.|/*/a[#href = 'javascript:print()']/preceding-sibling::node())
=
count(/*/a[#href = 'javascript:print()']/preceding-sibling::node())
]
Explanation:
This is a simple corollary of the Kayessian formula for the intersection of two nodesets $ns1 and $ns2:
$ns1[count(.|$ns2) = count($ns2)]
In our case, the set of all nodes between the nodes $vStart and $vEnd is the intersection of two node-sets: all following siblings of $vStart and all preceding siblings of $vEnd.
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vStart" select="/*/b[. = 'Content Title']"/>
<xsl:variable name="vEnd" select="/*/a[#href = 'javascript:print()']"/>
<xsl:template match="/">
<xsl:copy-of select=
"$vStart/following-sibling::node()
[count(.|$vEnd/preceding-sibling::node())
=
count($vEnd/preceding-sibling::node())
]
"/>
==============
<xsl:copy-of select=
"/*/b[. = 'Content Title']/following-sibling::node()
[count(.|/*/a[#href = 'javascript:print()']/preceding-sibling::node())
=
count(/*/a[#href = 'javascript:print()']/preceding-sibling::node())
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document (converted to a well-formed XML document):
<div class="post-body entry-content">
<a href="http://www.photourl">
<img alt="Photo title" height="333" src="http://photourl.com" width="500"/>
</a>
<br />
<br />
<div style="text-align: justify;">
Some text</div>
<b>More text</b>
<br />
<b>More text</b>
<br />
<br />
<ul>
<li>Numered item</li>
<li>Numered item</li>
<li>Numered item</li>
</ul>
<br />
<b>Content Title</b>
<br />
Some text
<br />
<br />
Some text(with links and images)
<br />
Some text(with links and images)
<br />
Some text(with links and images)
<br />
<br />
<br />
<a href="javascript:print()">
<img src="http://url.com/photo.jpg"/>
</a>
<div style="clear: both;"></div>
</div>
the two XPath expressions (with and without variable references) are evaluated and the nodes selected in each case, conveniently delimited, are copied to the output:
<br/>
Some text
<br/>
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
<br/>
<br/>
==============
<br/>
Some text
<br/>
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
<br/>
<br/>
To test your HTML, I have added tags around your code then pasting it in a file
xmllint --html --xpath '/html/body/div/text()' /tmp/l.html
output :
Some text
Some text
Some text
Some text
Now, you can use an Xpath module in ruby and re-use the Xpath expression
You will find many examples on stackoverflow website searches.