XPath: How do I find a page element which contains another element, using the full text of both? - xpath

I have an HTML page which contains the following:
<div class="book-info">
The book is <i>Italicized Title</i> by Author McWriter
</div>
When I view this in Chrome Dev Tools, it looks like:
<div class="book-info">
"The book is "
<i>Italicized Title</i>
" by Author McWriter"
</div>
I need a way to find this single div using XPath.
Constraints:
There are many book-info divs on the page, so I can't just look for a div with that class.
Any part of the text within the book-info div might also appear in another, but the complete text within the div is unique. So I want to match the entire text, if possible.
It is not guaranteed that an <i> will exist within the book-info div. The following could also exist, and I need to be able to find it as well (but my code is working for this case):
<div class="book-info">
"Author McWriter's Legacy"
</div>
I think I can detect whether the div I'm looking for contains an <i> or not, and construct a different XPath expression depending on that.
Things I have tried:
//div[text()=concat("The book is ","Italicized Title"," by Author McWriter")]
//div[text()=concat("The book is ","<i>Italicized Title"</i>," by Author McWriter")]
//div[text()=concat("The book is ",[./i[text()="Italicized Title"]," by Author McWriter")]
//div[concat(text()="The book is ", i[text()="Italicized Title"],text()=" by Author McWriter")]
None of these worked for me. What XPath expression would?

You can use this combination of XPath-1.0 predicates in one expression. It matches both cases:
//div[#class="book-info" and ((i and contains(text()[1],"The book is") and contains(text()[2],"by Author McWriter")) or (not(i) and contains(string(.),"Author McWriter&apos;s Legacy")))]

Related

I have doubts about two mappings on the site-prism

I don't want to use xpath on the elements below.
element :img_login, :xpath, '//[#id="main-wrapper"]/div/section/div/div[2]/div/div/div[1]/img'
element :msg_login_senha_invalidos, :xpath, '//[#id="main-wrapper"]/div/section/div/div[2]/div/div/div[2]/div/p'
They are on the page as follows:
element img_login
<div class="sc-jRQAMF eRnhep">
<img src="https://quasar-flash-staging.herokuapp.com/assets/login/flashLogo-3a77796fc2a3316fe0945c6faf248b57a2545077fac44301de3ec3d8c30eba3f.png" alt="Quasar Flash">
</div>
element msg_login_senha_invalidos
<p class="MuiFormHelperText-root MuiFormHelperText-contained Mui-error MuiFormHelperText-filled">Login e/ou senha inválidos</p>
You have asked multiple questions about converting from using XPath to some other type of selector when using Site-Prism. StackOverflow is meant to be a place to come, learn, and improve your skills - not just to get someone else to do your work. It really seems you'd be better off reading up on CSS and how it can be used to select elements. Also note that there's nothing specifically wrong with using XPath, per se, it's just the way people new to testing and selecting elements on a page tend to use it (just copying a fully specified selector from their browser) that leads to having selectors that are way too specific and therefore brittle. A good site for you to learn about the different general CSS selector options available is https://flukeout.github.io/ - and you can look at the built-in selector types provided by Capybara at https://github.com/teamcapybara/capybara/blob/master/lib/capybara/selector.rb#L18
In your current case the below may work, but with the HTML you have provided all that's possible to say is that they will match the elements shown however they may also match other elements which will give you ambiguous element errors.
element :img_login, :css, 'img[alt="Quasar Flash"]' # CSS attribute selector
element :msg_login_senha_invalidos, :css, 'p.Mui-error', text: 'Login e/ou senha inválidos' # CSS class selector combined with Capybara text filter

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

SO contributors. I am fully aware of the following question How to obtain href values from a div using xpath?, which basically deals with one part of my problem yet for some reason the solution posted there does not work in my case, so I would kindly ask for help in resolving two related issues. In the example below, I would like to get the href value of the "more" hyperlink (http://www.thestraddler.com/201715/piece2.php), which is under the div tag with content class.
<div class="content">
<h3>Against the Renting of Persons: A conversation with David Ellerman</h3>
[1]
</p>
<p>More here.</p>
</div>
In theory I should be able to extract the links under a div tag with
xidel website -e //div[#class="content"]//a/#href
but for some reason it does not work. How can I resolve this and (2nd part) how can I extract the href value of only the "here" hyperlink?

How to find xpath expression to select this text

I have this html code , trying many times to get the pure xpath for text "sample text" then "author" text in separate xpath and i don't find any criteria for that!!!
<div class="Text">
“sample article here with quotation marks .”
<br/>
―
Author
so please help , it make me mad!!
thanks
The first part you can get by getting the div by class, get br inside and retrieve the preceding-sibling's text:
//div[#class="Text"]/br/preceding-sibling::text()
The second part is easier, just get the text of a tag inside the div:
//div[#class="Text"]/a/text()

XPath expression -hierarchy

<div class="summary-item">
<label >Price</label>
<div class="value">
0.99 GBP
</div>
</div>
<div class="summary-item">
<label >Other info</label>
<div class="value">
All languages
</div>
</div>
I am trying to get the "0.99 GBP" using an XPath expression, so far I have reached the label using this (note there is another class by the name summary-item, therefore I need to uniquely identify with the label name Price)
sel.xpath('//*/div[#class="summary-item"]/label[text()="Price"]').extract()
However, I am unable to get to the class, I tried using following-sibling, but I did not succeed, any help will be appreciated.
The existence of child nodes can be part of the predicate. Put the test for label into a predicate for the parent, either as a separate predicate (adding the target node as well):
//div[#class="summary-item"][label[text()="Price"]]/div[#class="value"]
or joined with and:
//div[#class="summary-item" and label[text()="Price"]]/div[#class="value"]
(Note you don’t need //*/div at the start.)
You could use following-sibling if you wanted, it would look like this:
//div[#class="summary-item"]/label[text()="Price"]/following-sibling::div[#class="value"]
(here the label div isn’t part of the predicate).
One more thing to be aware of, using XPath to select HTML classes doesn’t work the same as using CSS – XPath will only match the exact string whereas CSS matches even if the element is in more than one class. In this case it works out okay but you should watch out for it. Search StackOverflow if it will be an issue, there are a few answers descibing it.

XPath/Scrapy crawling weirdly formatted pages

I've been playing around with scrapy and I see that knowledge of xpath is vital in order to leverage scrapy sucessfully. I have a webpage I'm trying to gather some information from where the tags are formatted as such
<div id = "content">
<h1></h1>
<p></p>
<p></p>
<h1></h1>
<p></p>
<p></p>
Now the heading contains a title and the first 'p' contains data1 and the second 'p' contains data2. This seems like a pretty straight forward task, and if this were always the case I would have no problem i.e. hsx.select('//*[#id="content"]') etc. etc.
The problem is, sometimes there will only be ONE p tag following a header instead of two.
<div id = "content">
<h1></h1>
<p></p> (a)
<h1></h1>
<p></p> (b)
<p></p> (c)
What i would like is if there is a paragraph tag missing I want to store that information as just blank data in my list. Right now what happens is the lists are storing the first heading 1, the first paragraph tag(a), and then the paragraph tag under the second h1 (b).
What it should be doing is storing
title -> h1[0]
data1[0] -> (a)
data2[0] ->[]
I hope that makes sense. I've been looking for a good xpath or scrapy solution to do this but I can't seem to find one. Any helpful tips would be awesome. thanks
Use:
//div[#id='content']
/h1[1]/following sibling::*
[not(position()>2)][self::p]
This selects the (utmost) two immediate sibling elements, only if they are p, of the first h1 child of any div (we know that this must be just one div) the string value of whoseidattribute is"content"`.
If only the first immediate sibling is a p, then the returned node-list contains only one item.
You can check whether the length of the returned node-list is 1 or 2, and use this to build the control of your processing.
I think you'd want something like this; not 100% though / untested.
//h1/following-sibling::*[2][self::p]/text()|//h1[not(following-sibling::*[2][self::p])]/string('')

Resources