Using chained xpath expressions to extract parent node - xpath

I want to extract both key names and values from the following HTML.
<ul>
<li><span class="label">Key A:</span> Value A
</li>
</ul>
<td>
<span class="label">Key B:</span> Value B
</td>
My strategy is to zoom into span.label directly to get the key, then zoom out to extract value from parent text. However, using the following xpath selectors, I am not able to extract the parent text successfully, even though //span[#class="label"]/parent::*/text() produced the right matches in Google Chrome.
for field in section.css('span.label'):
key = field.xpath('./text()').get().strip()
value = field.xpath('./parent::*/text()').get().strip()
section_fields[key]=value
Did I make a mistake with chained expressions?

Try it this way:
import lxml.html as lh
label = """[your html above]"""
doc = lh.fromstring(label)
for l in doc.xpath('//span[#class="label"]'):
print(l.text.strip(),l.tail.strip())
Output:
Key A: Value A
Key B: Value B

Well, you should fix your XPath for :
./parent::*/text()[normalize-space()]
to ignore whitespace nodes. Or you can use more directly :
./following::text()[1]
Piece of code :
data = """<ul>
<li><span class="label">Key A:</span> Value A
</li>
</ul>
<td>
<span class="label">Key B:</span> Value B
</td>"""
import lxml.html
tree = lxml.html.fromstring(data)
key=[]
value=[]
for field in tree.xpath('//span'):
key.append(field.xpath('./text()')[0].strip())
value.append(field.xpath('./parent::*/text()[normalize-space()]')[0].strip())
table=(list(zip(key,value)))
for a,b in table:
print(a,b)
Output :
Key A: Value A
Key B: Value B

Related

xPath - Why is this exact text selector not working with the data test id?

I have a block of code like so:
<ul class="open-menu">
<span>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text Here</strong>
<small>...</small>
</div>
</li>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text</strong>
<small>...</small>
</div>
</li>
</span>
</ul>
I'm trying to select a menu item based on exact text like so in the dev tools:
$x('.//*[contains(#data-testid, "menu-item") and normalize-space() = "Text"]');
But this doesn't seem to be selecting the element. However, when I do:
$x('.//*[contains(#data-testid, "menu-item")]');
I can see both of the menu items.
UPDATE:
It seems that this works:
$x('.//*[contains(#class, "menu-item") and normalize-space() = "Text"]');
Not sure why using a class in this context works and not a data-testid. How can I get my xpath selector to work with my data-testid?
Why is this exact text selector not working
The fact that both li elements are matched by the XPath expression
if omitting the condition normalize-space() = "Text" is a clue.
normalize-space() returns ... Text Here ... for the first li
in the posted XML and ... Text ... for the second (or some other
content in place of ... from div/svg or div/small) causing
normalize-space() = "Text" to fail.
In an update you say the same condition succeeds. This has nothing to
do with using #class instead of #data-testid; it must be triggered
by some content change.
How can I get my xpath selector to work with my data-testid?
By testing for an exact text match in the li's descendant strong
element,
.//*[#data-testid = "menu-item" and div/strong = "Text"]
which matches the second li. Making the test more robust is usually
in order, e.g.
.//*[contains(#data-testid,"menu-item") and normalize-space(div/strong) = "Text"]
Append /div/small or /descendant::small, for example, to the XPath
expression to extract just the small text.
data-testid="menu-item" is matching both the outer li elements while text content you are looking for is inside the inner strong element.
So, to locate the outer li element based on it's data-testid attribute value and it's inner strong element text value you can use XPath expression like this:
//*[contains(#data-testid, "menu-item") and .//normalize-space() = "Text"]
Or
.//*[contains(#data-testid, "menu-item") and .//*[normalize-space() = "Text"]]
I have tested, both expressions are working correctly

How to scarpe the href using Nokogiri

I have a variable e which stores a Nokogiri::XML::Element object.
when I execute puts e I get on the screen the following:
<h3 class="fixed-recipe-card__h3">
<a href="https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/" data-content-provider-id="" data-internal-referrer-link="hub recipe" class="fixed-recipe-card__title-link">
<span class="fixed-recipe-card__title-link">Chocolate Covered Strawberries</span>
</a>
</h3>
I would like to scrape this part https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/
How can I do this using Nokogiri
If you want to extract the link, you can use:
e.at_css("a").attributes["href"].value
.at_css returns the first element matching the CSS selector (another Nokogiri::XML::Element). To get a list of all matching elements, use .css instead.
.attributes gives you a hash mapping attribute name to Nokogiri::XML::Attr. Once you look up the desired attribute in this hash (href), you can call .value to get the actual text value.

XPath in RSelenium for indexing list of values

Here is an example of html:
<li class="index i1"
<ol id="rem">
<div class="bare">
<h3>
<a class="tlt mhead" href="https://www.myexample.com">
<li class="index i2"
<ol id="rem">
<div class="bare">
<h3>
<a class="tlt mhead" href="https://www.myexample2.com">
I would like to take the value of every href in a element. What makes the list is the class in the first li in which class' name change i1, i2.
So I have a counter and change it when I go to take the value.
i <- 1
stablestr <- "index "
myVal <- paste(stablestr , i, sep="")
so even if try just to access the general lib with myVal index using this
profile<-remDr$findElement(using = 'xpath', "//*/input[#li = myVal]")
profile$highlightElement()
or the href using this
profile<-remDr$findElement(using = 'xpath', "/li[#class=myVal]/ol[#id='rem']/div[#id='bare']/h3/a[#class='tlt']")
profile$highlightElement()
Is there anything wrong with xpath?
Your HTML structure is invalid. Your <li> tags are not closed properly, and it seems you are confusing <ol> with <li>. But for the sake of the question, I assume the structure is as you write, with properly closed <li> tags.
Then, constructing myVal is not right. It will yield "index 1" while you want "index i1". Use "index i" for stablestr.
Now for the XPath:
//*/input[#li = myVal]
This is obviously wrong since there is no input in your XML. Also, you didn't prefix the variable with $. And finally, the * seems to be unnecessary. Try this:
//li[#class = $myVal]
In your second XPath, there are also some errors:
/li[#class=myVal]/ol[#id='rem']/div[#id='bare']/h3/a[#class='tlt']
^ ^ ^
missing $ should be #class is actually 'tlt mhead'
The first two issues are easy to fix. The third one is not. You could use contains(#class, 'tlt'), but that would also match if the class is, e.g., tltt, which is probably not what you want. Anyway, it might suffice for your use-case. Fixed XPath:
/li[#class=$myVal]/ol[#id='rem']/div[#class='bare']/h3/a[contains(#class, 'tlt')]

Xpath - how to extract data using class target

Code snippet:
<td class="right odds down"><a class=" betslip" target="unibet" onmouseout="delayHideTip()" onmouseover="page.hist(this,'P-0.00-0-0','24vekxv464x0x4g25d',5,event,1,1)" href="/bookmaker/unibet/betslip//event/1002752206/coupon/single,2133228960,p,[0]">1.70</a></td>
I trying to extract data from a page where the class target is "Unibet".
What would be the correct formatting for this query?
Ive tried:
//*[classtarget="unibet"]//td/a/#class
Well, target is attribute, not class, of element <a>. The XPath to find <td> element and then return the child element <a> where target attribute value equals "unibet" will be :
//td/a[#target='unibet']
if you want to return class attribute of the <a> element instead, simply add a trailing /#class to the above XPath :
//td/a[#target='unibet']/#class

Use Xpath To Retrieve Elements

HTML Portion:
<div class="abc">
<div style="text-align:left; itemscopr itemtype="xyz">
<h1 itemtype="mno"> I want this text </h1>
</div>
</div>
I am using
$text = $xpath->query('//div[class="abc"]/div/h1]
but I am getting no value. Please help me as I am new to it.
You should try
//div[#class="abc"]/div/h1
The difference is in the # sign before class, because the attribute axis is accessed this way. When you omit the # sign, it looks for node names (tag names).
This returns you the whole h1 node (or, rather, a node-set containing all the matching h1 nodes).
If you only wanted the text from the element, try the evaluate function instead:
$text = $xpath->evaluate("//div[#class='abc']/div/h1/text()")

Resources