XPath - Nested path scraping - xpath

I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:

This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.

Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]

I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt

Related

XPath selecting specific child element

I've some problem with Xpath syntax with html. I want to select an item which is into a div.
I have a Div define by an id : "popin".
In this div, I have a span with his id is "id_yes".
I can get the div with //DIV[contains(#id ,'popin')] but I failed to get the span element.
Have you a solution ?
If you have the ID, you can use:
//span[#id="id_yes"]
If you want to be more specific, //div[#id="popin"]/span[#id="id_yes"]
That, assuming your IDs are unique.

Retrieving a parent tag with a given attribute that contains a subelement by using XPath

How I can retrieve multiple DIVs (with a given class attribute "a") that contain a span tag with a class attribute "b" by using Xpath?
<div class='a'>
<span class='b'/>
</div>
The structure of my XML is not defined so basically the span could be at any level of the div and the div itself could be at any level of the XML tree.
This should work:
//div[#class='a'][span/#class='b']
// means search anywhere if it starts the expression.
If the span is deeper in the div, use descendant:: which can be shortened to // again:
//div[#class='a'][.//span/#class='b']

Xpath expression returns null

I have the plenty of links like this:
<b>Edit issue >></b>
Trying to extract the href' content I use Xpath expression:
//a[contains(#href,'/edit_flat')]
but it returns me null. What am I doing wrong ?
//a[contains(#href,'/edit_flat')] selects a elements anywhere in the document tree that have an href attribute containing the '/edit_flat' string.
These matching elements do have this very "href" attribute, but the XPath expression you are using returns "only" the a elements, if there are any.
To actually return the matching elements' attribute's values, you need an extra step, with / and #href. So what you want is:
//a[contains(#href,'/edit_flat')]/#href
Suggestion:
What you really want is probably to select links which href begin with the substring "/edit_flat", so it's safer to use:
.//a[starts-with(#href,'/edit_flat')]/#href

Use Nokogiri to get all nodes in an element that contain a specific attribute name

I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"

XPath query to select all href attributes of <a> tag, which 'class' attribute equals specified string

I don't know why following query doesn't work:
//a/#href[#class='specified_string']
Try it the other way round:
//a[#class='specified_string']/#href
After all, class is an attribute of the <a> element, not an attribute of the href attribute.
An attribute cannot have attributes. Only elements can have attributes.
The original XPath expression:
//a/#href[#class='specified_string']
selects any href attribute of any a element, such that the href attribute has an attribute class whose value is 'specified_string'.
What you want is:
//a[#class='specified_string']/#href
that is: the href attribute of any a element that has class atribute with value 'specified_string'.
You basically say that you are looking for an attribute named href, whose attribute (this is the error) class should be equal to specified_string.
But you need to find the attribute href of an element a, whose attribute class is specified_string.
(ndim's answer overlapped mine)
There is not class attribute present in anchor tag I have href only. It is identified using //*[#href='value'] but //*a[#href='value'] is not working

Resources