XPath How to Get Posts Separately - xpath

How can i get texts with xpath separately?
Code i tried only gets 1 with all info instead of separate:
Post xpath: div
Title xpath: ./p/strong/child::node()
Desc xpath: ./ul/child::node()
Desired:
Title1
Desc1
Title2
Desc2
Got:
Title1 Title2
Desc1 Desc2
HTML:
<div>
<p><strong>Title1</strong></p>
<ul>
<li>Desc1</li>
</ul>
<p><strong>Title2</strong></p>
<ul>
<li>Desc2</li>
</ul>
</div>

Not really clear what your "Desired" example is representing with pairs labeled 1 and 2, but if you are just trying to select each title text followed by its immediate following ul/li text you can use an expression such as:
//div/p/(
./normalize-space(string()),
./(following-sibling::ul[1])/normalize-space(string()))
For each p it selects the entire text content of the p as string and then selects the immediately following ul sibling of the p and selects its entire string content. This can be easily refined to only select p/strong content (instead of all of the p) and similar for ul/li.

Related

xPath - Why is this exact text selector not working with the data test id?

I have a block of code like so:
<ul class="open-menu">
<span>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text Here</strong>
<small>...</small>
</div>
</li>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text</strong>
<small>...</small>
</div>
</li>
</span>
</ul>
I'm trying to select a menu item based on exact text like so in the dev tools:
$x('.//*[contains(#data-testid, "menu-item") and normalize-space() = "Text"]');
But this doesn't seem to be selecting the element. However, when I do:
$x('.//*[contains(#data-testid, "menu-item")]');
I can see both of the menu items.
UPDATE:
It seems that this works:
$x('.//*[contains(#class, "menu-item") and normalize-space() = "Text"]');
Not sure why using a class in this context works and not a data-testid. How can I get my xpath selector to work with my data-testid?
Why is this exact text selector not working
The fact that both li elements are matched by the XPath expression
if omitting the condition normalize-space() = "Text" is a clue.
normalize-space() returns ... Text Here ... for the first li
in the posted XML and ... Text ... for the second (or some other
content in place of ... from div/svg or div/small) causing
normalize-space() = "Text" to fail.
In an update you say the same condition succeeds. This has nothing to
do with using #class instead of #data-testid; it must be triggered
by some content change.
How can I get my xpath selector to work with my data-testid?
By testing for an exact text match in the li's descendant strong
element,
.//*[#data-testid = "menu-item" and div/strong = "Text"]
which matches the second li. Making the test more robust is usually
in order, e.g.
.//*[contains(#data-testid,"menu-item") and normalize-space(div/strong) = "Text"]
Append /div/small or /descendant::small, for example, to the XPath
expression to extract just the small text.
data-testid="menu-item" is matching both the outer li elements while text content you are looking for is inside the inner strong element.
So, to locate the outer li element based on it's data-testid attribute value and it's inner strong element text value you can use XPath expression like this:
//*[contains(#data-testid, "menu-item") and .//normalize-space() = "Text"]
Or
.//*[contains(#data-testid, "menu-item") and .//*[normalize-space() = "Text"]]
I have tested, both expressions are working correctly

XPath node that doesn't contain a child

I'm trying to access a certain element from by using XML but I just can't seem to get it, and I don't understand quite why.
<ul class="test1" id="content">
<li class="list">
<p>Insert random text here</p>
<div class="author">
</div>
</li>
<li class="list">
<p>I need this text here</p>
</li>
</ul>
Basically the text I want is the second one but I want/need to use something similar to p[not(div)] as to retrieve it.
I have tried the methods from the following link but to no avail (xpath find node that does not contain child)
Here is how I tried accessing the text:
ul[contains(#id,"content")]//p[not(.//div)]/text()
If you have any possible answers, thank you !
The HTML snippet posted in question shows that both p elements do not contain any div, so the expression //p[not(.//div)] would match both p. The first p element is sibling of the div (both shares the same parent element li) instead of parent or ancestor. The following XPath expression would match text nodes from the 2nd p and not those from the first one:
//ul[contains(#id,"content")]/li[not(div)]/p/text()
Brief explanation:
//ul[contains(#id,"content")]: find ul elements where id attribute value contains text "content"
/li[not(div)]: from such ul find child elements li that don't have child element div. This will match only the end li in the example HTML
/p/text(): from such li, find child elements p and then return child text nodes form such p

scrapy and xpath: get the text in a child element, if the parent element contains text

How do i get the text of a child element, if the parent element contains text with a specific string?
For example:
<li>
"string1"
<span>
"Hello"
</span>
</li>
<li>
"string2"
<span>
"Ola"
</span>
</li>
From the above html code, how to get only string "Ola" using xpath?
Without knowing scrapy, I would try
//li[text()[contains(.,"string2")]]/span/text()
//li[text()[contains(.,"string2")]] select a li element that text contains string2
/span select a element span below the selected li
/text(): return the text of the selected span element
Update: This is simpler and should also work:
//li[contains(text(),"string2")]/span/text()

select parent node containing text inside children's node

basically i want to select a node (div) in which it's children node's(h1,b,h3) contain specified text.
<html>
<div id="contents">
<p>
<h1> Child text 1</h1>
<b> Child text 2 </b>
...
</p>
<h3> Child text 3 </h3>
</div>
i am expecting, /html/div/ not /html/div/h1
i have this below, but unfortunately returns the children, instead of the xpath to the div.
expression = "//div[contains(text(), 'Child text 1')]"
doc.xpath(expression)
i am expecting, /html/div/ not /html/div/h1
So is there a way to do this simply with xpath syntax?
The following expression gives a node (div) in which any children nodes (not just h1,b,h3) contain specified text (not the div itself):
doc.xpath('//div[.//*[contains(text(), "Child text 1")]]')
you can refine that and return the only the div with the id contents like in your example:
doc.xpath('//div[#id="contents" and .//*[contains(text(), "Child text 1")]]')
It does not match, if the text is a text node of the div (directly inside the div), which is my interpretation of the question.
You could append "/.." to anchor back to the parent. Not sure if there's a more robust method.
expression = "//div[contains(text(), 'Child text 1')]/.."

Xpath: how do you select the second text node (specific text node)

consider a html page
<html>
apple
orange
drugs
</html>
how can you select orange using xpath ?
/html/text()[2]
doesn't work.
You cant do it directly by selecting. You need to call an xpath string function to cut the text() to get the string you want
substring-after(/html/text()," ") // something like this,
here is a list of string functions
If the strings are separated with <br> it works
doc = Nokogiri::HTML("""<html>
apple
<br>
orange
<br>
drugs
</html>""")
p doc.xpath('//text()[2]') #=> orange

Resources