Is there an xpath way to select a given attribute value?
For example I have an html document and want to select only "?ms=669601" :
<input type="button" value="تفاصيل" onclick="xmlreqGET("?ms=669601","jm1x");">
In your simple example, you could simply select that portion of the onclick attribute in the only input:
substring(input/#onclick, 12, 10)
In more complicated documents, try selecting first by #value (or some other (possibly unique) criteria):
substring(//input[#value='تفاصيل']/#onclick, 12, 10)
Or by targeting the input that contains part of the desired substring:
substring(//input[contains(#onclick, 'xmlreqGET(')]/#onclick, 12, 10)
Selecting the input element itself if its onclick attribute contains the target string:
//input[contains(#onclick, '?ms=669601')]
Note: Your input is not valid XML, due to nested double-quotes.
Related
I have an XML of the form:
<articleslist>
<articles>
<originalId>507948</originalId>
<title>Hogan Lovells Training Contract</title>
<slug>hogan-lovells-training-contract</slug>
<metaTitle>Hogan Lovells Training Contract</metaTitle>
<metaDescription>Find out about the Hogan Lovells Training Contract and Application Process</metaDescription>
<language>en</language>
<disableAds>false</disableAds>
<shortUrl>false</shortUrl>
<category_slug>law</category_slug>
<subcategory_slug>industry</subcategory_slug>
<updatedAt>2021-03-15T18:38:51.058+00:00</updatedAt>
<createdAt>2018-11-29T06:42:51.665+00:00</createdAt>
</articles>
</articlelist>
I'm able to select the row values with the XPATH //articles.
How can I select the child properties of articles (i.e. the column headings), so I get back a list of the form:
originalId
title
slug
etc...
Depends on your XPath version.
In XPath 2.0 it's simply //articles/*/name()
In 1.0 it's not possible because there's no such data type as a "sequence of strings". You would have to return the set of elements as //articles/*, and then extract their names in the calling program.
From this code as below:
<span id="cTDQo7-img" class="z-menu-img"></span> payment
<span id="cTDQo7-img" class="z-menu-img"></span>
"payment"
I would like to get locator use keyword contains but the word "payment" is
a lot of the page such as payment1,payment2,payment3
And id is not unique.
I tried to use the code below but not work for me.
//a[contains(.,'payment')]
//span[#class='z-menu-img'] [contains(.,'payment')]
//span[#class='z-menu-img'] and [contains(.,'payment')]
//span[#class='z-menu-img'] contains(.,'payment')
Option 1 : Use the other attributes in combination with text
//a[#class='z-menu-cnt z-menu-cnt-img' and normalize-space(.)='payment']
Option 2: Specify the position if you have multiple elements without unique attributes/path
(//a[contains(.,'payment')])[1]
The second xpath will identify the first occurrence of the link contains text 'payment'. You can change the tagname and index based on your interest.
I'm trying to use Xpath in order to select an HTML tag based on its value
Here is my html code:
<span class="yellowbird">Continue</span>
<span class="yellowbird">Stop</span>
I can select the span elements with a specific class value using
//span[contains(#class, 'yellowbird')]
However I'm struggling to select only the element which contains the value "Continue"
This XPath expression will select any span element whose class attribute equals yellowbird and text equals Continue:
//span[#class='yellowbird' and text()='Continue']
Here is the syntax I used to make this work using request.xpath and scrapy
//span[contains(#class, 'yellowbird')][1]//text()='Continue'
I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt
Is there any way to specify that I want to select only tag-less child elements (in the following example - "text")?
<div>
<p>...</p>
"text"
</div>
The text() function matches text nodes. Example: //div/text() — matches all text children within all div elements.
Use:
/*/text()[normalize-space()]
This selects all text nodes that are children of the top element of the document and that do not consist only of white-space characters.
In the concrete example this will select only the text node with string value:
'
"text"
'
The XPath expressions:
/*/text()
or
/div/text()
both select two text nodes, the first of which contains only white-space and the second is the same text node as above:
'
"text"
'
select only tag-less child elements
To me this sounds like selecting all elements that don't have other elements as children. But then again, "text" in your example is not an element, but a text node, so I'm not really sure what do you want to select...
Anyway, here is a solution for selecting such elements.
//*[not(*)]
Selects all elements that don't have an element as a child. Replace the first * with an element name if you only want to select certain elements that don't have child elements. Also note that using // is generally slow since it runs through the whole document. Consider using more specific path when possible (like /div/*[not(*)] in this case).