XPath Getting child elements from html - xpath

I am trying to find the xpath for only the child of a navigation bar. The path which I am trying at the moment is //div[#class='navCol subMenus'] from this peace of HTML.
<div class="PrimaryNavigationContainer">
<div class="PrimaryNavigation">
<div class="Menu">
<div>
<span>Brands</span>
<div class="navCol">
<div>
<a class="NoLink unselectable"><span>Shop by Brand</span></a>
<div class="navCol subMenus">
<div>
<span>blah</span>
I have tried a number of Xpath syntax but none seem to work to bring up just the sub categories. Thank you for any help which you can provide.

Related

Telegram instant view

I need a help to solve a problem that I am facing in Instant viewr. The problem is removing a <div> from my instant view.
Tried to delete through id and class, but it does not work:
#remove:$body//div[#id="navbar-complex"]
#remove: //div[#class="navbar-complex"]
The Html of div that I want to remove (link to the page):
<div id="navbar-complex" class="scrollmenu tab-content nav nav-tabs">
<i class="bi bi-building"></i> О комплексе
The same problem with following div:
<div class="csection-item-wrap">
<div class="csection-item rco1 stat3">

Parsing through response created with XPath

Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?
Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()

Xpath not just getting parent of html

I am trying to find the xpath for only the parent of a navigation bar. The path which I am trying at the moment is `//a[#class='unselectable'] from this peace of HTML.
`<div class="PrimaryNavigationContainer">
<div class="PrimaryNavigation">
<div class="Menu">
<div>
<a href="http://www.blah.co.uk/brands.aspx" class="unselectable"><span>
Brands</span></a>
<div class="navCol">
<div>
<a class="NoLink unselectable"><span>Shop by Brand</span></a>
<div class="navCol subMenus">
div>
<a href="http://www.blah.co.uk/blah/catlist_bd4.htm" class="unselectable"><span>
blah</span></a>
The xpath seem to be bringing up both the top level cats and sub categories and I am because it is in both but not sure how to single of the parent from the chld. Thanks for any help which you can provide
How about //div[#class="Menu"]/div/a[#class='unselectable']? This way you avoid selecting the a in the subMenus div.

using variables in HtmlXPathSelectors

I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]').
I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in.
Update:
So for the sample text below I want to extract the div where id="footer":
<div id="footer">
<div id="footer-menu">
<div class="region-footer-menu">
<div id="block-menu-menu-footer-menu" class="block-menu">
<div class="content">
<ul class="menu">
<li class="first leaf">FAQs</li>
<li class="leaf">Media</li>
<li class="leaf">Partners</li>
<li class="last leaf active-trail">Jobs</li>
</ul>
</div>
</div>
<div id="block-block-52" class="block block-block">
<div class="content">
<p>SUPPORT</p>
</div>
</div>
</div>
</div>
</div>
We initialize hxs = HtmlXPathSelector(response) for all the below segments.
The following code selects only the first div:
hxs.select('//div[#id=concat("foot","er")]')
This code selects nothing but gives no error:
hxs.select('//div[#id="foot"+"er"]')
Both of the below code segments select nothing and give no errors:
long_string = "foot"
hxs.select('//div[#id=concat(long_string,"er")]')
hxs.select('//div[#id=long_string]')
I would like to be able to do either of the bottom two methods and return the desired results.
Assuming + works for string concatenation in Scrapy, this should work:
hxs.select('//div[#id="' + long_string + '"]')
I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.
have you tried?
hxs.select('\\div[#id="' + long_string_variable + '"]')

xPath strange behaviour - selecting ALL elements even if [1] set

today I stumbled upon a very interesting case (at least for me). I am messing around with Selenium and xPath and tried to get some elements, but got a strange behaviour:
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some other text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some even unrelated text
</a>
</div>
</div>
</div>
This is my data.
When i run the following xPath query:
//div[#class="title"][1]/a
I get as a result ALL instead of only the first one. But if I query:
//div[#class="resultcontainer"][1]/div[#class="info"]/div[#class="title"]/a
I get only the first , not all.
Is there some divine reason behind that?
Best regards,
bisko
I think you want
(//div[#class="title"])[1]/a
This:
//div[#class="title"][1]/a
selects all (<a> elements that are children of) <div> elements that have a #class of 'title', that are the first children of their parents (in this context). Which means: it selects all of them.
The working XPath selects all <div> elements that have a #class of 'title' - and of those it takes the first one.
The predicates (the expressions in square brackets []) are applied to each element that matched the preceding location step (i.e. "//div") individually. To apply a predicate to a filtered set of nodes, you need to make the grouping clear with parentheses.
Consequently, this:
//div[1][#class="title"]/a
would select all <div> elements, take the first one, and then filter it down futher by checking the #class value. Also not what you want. ;-)

Resources