XPath: extract attribute value plus inner text of child element - xpath

Is it possible to use an XPath expression to extract both the value of an attribute and the value of the inner text of a child element? For example, I want to obtain the value 6 and the six values 100, 200, 300, 400, 500, 600 from the following XML:
<Main>
<Example dim="6">
<Example-i>100,200,300,400,500,600</Example-i>
</Example>
</Main>
I can see how to get the value of the dim= attribute (e.g. //Example/#dim), and the inner text of the <Example-i> element (e.g., //Example-i/text()), but I cannot figure out a unified XPath expression to extract both.

You can use the fn:concat(...) function like this
concat(/Main/Example/#dim,' - ',/Main/Example/Example-i)
Or, in a more generalzied way
concat(//Example/#dim,' - ',//Example/Example-i)
Its output, in both cases, is
6 - 100,200,300,400,500,600

Related

XPath for extracting image scr with style attribute with the help of Screaming Frog

My goal is to extract image URLs together with style attribute (width and height) with the help of Screaming Frog.
<p style="text-align: center;"><img alt="Scary games are all about submerging into unknown territories" src="//cdn01.x-plarium.com/browser/content/blog/images/2022/scary-games-2.webp" style="width: 640px; height: 426px;"></p>
I am adding the following XPath for custom extraction - //img[contains(#style)]/#src
But getting errors for this.
Will really appreciate any help.
Your XPath below contains an error
//img[contains(#style)]/#src
The contains functions expects two parameters. You have only passed one (#style). The parameters are both strings; if the second string is a substring of the first then the function returns true, otherwise it returns false.
If you just want to check that the style attribute has some value (any value) then the following will work:
//img[#style]/#src
If you want to check that the style attribute contains some particular string (e.g. 'width') then you want something like this:
//img[contains(#style, 'width')]/#src

Xpath expression (nokogiri) to get tag's child element?

From my xml, I can get this :
<home>
<creditors>
<count>2</count>
</creditors>
</home>
OR even this :
<home>
<creditors>
<moreThan>2</moreThan>
</creditors>
</home>
Which xpath expression can I use to get "<count>2</count>" instead of getting only "2" OR to get "<moreThan>2</moreThan>" instead of getting "2" ?
This XPath,
//creditors/count
will select all count child elements of all creditors elements in the XML document.
Update per OP's request in comments for a single XPath that selects both count and moreThan elements:
This XPath,
//creditors/*[self::count or self::moreThan]
will select all count or moreThan child elements of all creditors elements in the XML document.
Assuming that your xpath expression is OK, you just need to convert the element to string:
doc.xpath("home/creditors/*").to_s
=> "<count>2</count>"
Please check with queries returning more than one element, to make sure that it's desired behaviour.

XPath - Nested path scraping

I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt

xpath expression to select attribute value

Is there an xpath way to select a given attribute value?
For example I have an html document and want to select only "?ms=669601" :
<input type="button" value="تفاصيل" onclick="xmlreqGET("?ms=669601","jm1x");">
In your simple example, you could simply select that portion of the onclick attribute in the only input:
substring(input/#onclick, 12, 10)
In more complicated documents, try selecting first by #value (or some other (possibly unique) criteria):
substring(//input[#value='تفاصيل']/#onclick, 12, 10)
Or by targeting the input that contains part of the desired substring:
substring(//input[contains(#onclick, 'xmlreqGET(')]/#onclick, 12, 10)
Selecting the input element itself if its onclick attribute contains the target string:
//input[contains(#onclick, '?ms=669601')]
Note: Your input is not valid XML, due to nested double-quotes.

Use Nokogiri to get all nodes in an element that contain a specific attribute name

I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"

Resources