I have some HTML like this (which I cannot change):
<div>
<p class="name">
<span>Employee Name: </span>
John Smith
</p>
</div>
And I'd like to use xpath to extract out just the "John Smith" part..
I have been trying to use this code:
//div//p[#class='name']//text()
However, it doesn't work.
What is the best way to achieve what I'm after?
Many thanks.
You almost have it.
Change your XPath to: //div//p[#class='name']/text()
When you use //text(), it selects all of the descendant text() nodes, which includes the "Employee Name: " text node that is a child of the <span>.
It is best to avoid the // when possible, as it makes your expressions less efficient and more prone to those sort of issues.
Related
I have an HTML page which contains the following:
<div class="book-info">
The book is <i>Italicized Title</i> by Author McWriter
</div>
When I view this in Chrome Dev Tools, it looks like:
<div class="book-info">
"The book is "
<i>Italicized Title</i>
" by Author McWriter"
</div>
I need a way to find this single div using XPath.
Constraints:
There are many book-info divs on the page, so I can't just look for a div with that class.
Any part of the text within the book-info div might also appear in another, but the complete text within the div is unique. So I want to match the entire text, if possible.
It is not guaranteed that an <i> will exist within the book-info div. The following could also exist, and I need to be able to find it as well (but my code is working for this case):
<div class="book-info">
"Author McWriter's Legacy"
</div>
I think I can detect whether the div I'm looking for contains an <i> or not, and construct a different XPath expression depending on that.
Things I have tried:
//div[text()=concat("The book is ","Italicized Title"," by Author McWriter")]
//div[text()=concat("The book is ","<i>Italicized Title"</i>," by Author McWriter")]
//div[text()=concat("The book is ",[./i[text()="Italicized Title"]," by Author McWriter")]
//div[concat(text()="The book is ", i[text()="Italicized Title"],text()=" by Author McWriter")]
None of these worked for me. What XPath expression would?
You can use this combination of XPath-1.0 predicates in one expression. It matches both cases:
//div[#class="book-info" and ((i and contains(text()[1],"The book is") and contains(text()[2],"by Author McWriter")) or (not(i) and contains(string(.),"Author McWriter's Legacy")))]
<div class="summary-item">
<label >Price</label>
<div class="value">
0.99 GBP
</div>
</div>
<div class="summary-item">
<label >Other info</label>
<div class="value">
All languages
</div>
</div>
I am trying to get the "0.99 GBP" using an XPath expression, so far I have reached the label using this (note there is another class by the name summary-item, therefore I need to uniquely identify with the label name Price)
sel.xpath('//*/div[#class="summary-item"]/label[text()="Price"]').extract()
However, I am unable to get to the class, I tried using following-sibling, but I did not succeed, any help will be appreciated.
The existence of child nodes can be part of the predicate. Put the test for label into a predicate for the parent, either as a separate predicate (adding the target node as well):
//div[#class="summary-item"][label[text()="Price"]]/div[#class="value"]
or joined with and:
//div[#class="summary-item" and label[text()="Price"]]/div[#class="value"]
(Note you don’t need //*/div at the start.)
You could use following-sibling if you wanted, it would look like this:
//div[#class="summary-item"]/label[text()="Price"]/following-sibling::div[#class="value"]
(here the label div isn’t part of the predicate).
One more thing to be aware of, using XPath to select HTML classes doesn’t work the same as using CSS – XPath will only match the exact string whereas CSS matches even if the element is in more than one class. In this case it works out okay but you should watch out for it. Search StackOverflow if it will be an issue, there are a few answers descibing it.
i need to scrap information form a website contain the property details.
<div class="inner">
<div class="col">
<h2>House in Digana </h2>
<div class="meta">
<div class="date"></div>
<span class="category">Houses</span>,
<span class="location">Kandy</span>
</div>
</div>
<div class="attr polar">
<span class="data">Rs. 3,600,000</span>
</div>
what is the xpath notation for "Kandy" and "Rs. 3,600,000" ?
It is not wise to address text nodes directly using text() because of nuances in an XML document.
Rather, addressing an element node directly returns the concatenation of all descendant text nodes as the element value, which is what people usually want (and think they are getting when they address text nodes).
The canonical example I use in the classroom is this example of OCR'ed content as XML:
<cost>39<!--that 9 may be an 8-->.22</cost>
The value of the element using the XPath address cost is "39.22", but in XSLT 1.0 the value of the XPath address cost/text() is "39" which is not complete. In XSLT 2.0 (which is how the question is tagged), you get two text nodes "39" and ".22", which if you concatenate them it looks correct. But, if you pass them to a function requiring a singleton argument, you will get a run-time error. When you address an element, the text returned is concatenated into a single string, which is suitable for a singleton argument.
I tell students that in all of my professional work there are only very (very!) few times that I ever have to use text() in my stylesheets.
So //span[#class='location' or #class='data'] would find the two fields if those were the only such elements in the entire document. You may need to use ".//span" from a location inside of the document tree.
hey guys coudln't get around this. I have an html structured as follow:
<div class="review-text">
<div id="reviewerprofile">
<div id="revimg"></div>
<div id="reviewr">marc</div>
<div id="revdate">2011-07-06</div>
</div>
this is an awesome review
</div>
what i am trying to get is just the text "this is an awesome review" but everytyme i query the node i also get the other content in the childs. using something like this now ".//div[#class='review-text']" how to get just that text only? tank you very much
You're almost there! Just add /text() at the end of your XPath to get the text node.
An XPath expression such as //div returns a set of nodes, in this case div elements. These are in effect pointers to the original nodes in the original tree; the nodes are still connected to their parents, children, ancestors, and siblings. If you see the children of the div element and don't want them, that's not the fault of the XPath processor, it's the fault of whatever software is processing the results returned by the XPath expression.
You can get the text that's an immediate child of the div element by using /text() as suggested. However, that assumes that you know exactly what you are expecting to find in the HTML page - if "awesome" were in italics, it would give you something different.
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Assuming the above, how is it possible to use Xpath to grab each fruit ? Must use xpath of some sort.
should i use substring-after(following-sibling...) ?
EDIT: I am using Nokogiri parser.
Well, you could use "//br/text()", but that will return all the text-nodes inside the <br> tags. But since the above isn't well-formed xml I'm not sure how you are going to use xpath on it. Regex is usually a poor choice for html, but there are html (not xhtml) parsers available. I hesitate to suggest one for ruby simply because that isn't "my area" and I'd just be googling...
Try the following, which gets all text siblings of <br> tags as array of strings stripped from trailing and leading whitespaces:
require 'rubygems'
reguire 'nokogiri'
doc = Nokogiri::HTML(DATA)
fruits =
doc.xpath('//br/following-sibling::text()
| //br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
puts fruits
__END__
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Is this what you want?
There are several issues here:
XPath works on XML - you have HTML which is not XML (basically, the tags don't match so an XML parser will throw an exception when you give it that text)
XPath normally also works by finding the attributes inside tags. Seeing as your <br> tags don't actually contain the text, they're just in-between it, this will also prove difficult
Because of this, what you probably want to do is use XPath (or similar) to get the contents of the div, and then split the string based on <br> occurrences.
As you've tagged this question with ruby, I'd suggest looking into hpricot, as it's a really nice and fast HTML (and XML) parsing library, which should be much more useful than mucking around with XPath