How to select this element with Scrapy XPATH?

How to select this element with Scrapy XPATH? - xpath

Only requirement: it needs to refer to the thread-navigation class, because that page has many other pagination elements
<section id="thread-navigation" class="group">
<div class="float-left">
<div class="pagination talign-mleft">
<span class="pages">Pages (6):</span>
<span class="pagination_current">1</span>
2
3
4
5
6
Next » //<--- this one
</div>
</div>
</section>
I was trying something like this:
r.xpath('//*[#class="thread-navigation" and contains (., "Next")]').get()
But it always returns None
Thank you

You are not referring to an #class attribute, but rather to an #id attribute with the value thread-navigation. So try this XPath-1.0 expression:
r.xpath('//a[ancestor::*/#id="thread-navigation" and contains (text(), "Next")]/#href').get()
Its result is
I want this text?page=2

This xpath:
'//section[#id="thread-navigation"]//a/#href'

Related

Parsing through response created with XPath

Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?

Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()

Watir: How to retrieve all HTML elements that match an attribute? (class, id, title, etc)

I have a page that is dynamically created and displays a list of products with their prices. Since it's dynamic, the same code is reused to create each product's information, so they share the tags and same classes. For instance:
<div class="product">
<div class="name">Product A</div>
<div class="details">
<span class="description">Description A goes here...</span>
<span class="price">$ 180.00</span>
</div>
</div>
<div class="product">
<div class="name">Product B</div>
<div class="details">
<span class="description">Description B goes here...</span>
<span class="price">$ 43.50</span>
</div>
</div>`
<div class="product">
<div class="name">Product C</div>
<div class="details">
<span class="description">Description C goes here...</span>
<span class="price">$ 51.85</span>
</div>
</div>
And so on.
What I need to do with Watir is recover all the texts inside the spans with class="price", in this example: $ 180.00, $43.50 and $51.85.
I've been playing around with something like this:
#browser.span(:class, 'price').each do |row| but is not working.
I'm just starting to use loops in Watir. Your help is appreciated. Thank you!

You can use pluralized methods for retrieving collections - use spans instead of span:
#browser.spans(:class => "price")
This retrieves a span collection object which behaves in similar to the Ruby arrays so you can use Ruby #each like you tried, but i would use #map instead for this situation:
texts = #browser.spans(:class => "price").map do |span|
span.text
end
puts texts
I would use the Symbol#to_proc trick to shorten that code even more:
texts = #browser.spans(:class => "price").map &:text
puts texts

using variables in HtmlXPathSelectors

I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]').
I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in.
Update:
So for the sample text below I want to extract the div where id="footer":
<div id="footer">
<div id="footer-menu">
<div class="region-footer-menu">
<div id="block-menu-menu-footer-menu" class="block-menu">
<div class="content">
<ul class="menu">
<li class="first leaf">FAQs</li>
<li class="leaf">Media</li>
<li class="leaf">Partners</li>
<li class="last leaf active-trail">Jobs</li>
</ul>
</div>
</div>
<div id="block-block-52" class="block block-block">
<div class="content">
<p>SUPPORT</p>
</div>
</div>
</div>
</div>
</div>
We initialize hxs = HtmlXPathSelector(response) for all the below segments.
The following code selects only the first div:
hxs.select('//div[#id=concat("foot","er")]')
This code selects nothing but gives no error:
hxs.select('//div[#id="foot"+"er"]')
Both of the below code segments select nothing and give no errors:
long_string = "foot"
hxs.select('//div[#id=concat(long_string,"er")]')
hxs.select('//div[#id=long_string]')
I would like to be able to do either of the bottom two methods and return the desired results.

Assuming + works for string concatenation in Scrapy, this should work:
hxs.select('//div[#id="' + long_string + '"]')
I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.

have you tried?
hxs.select('\\div[#id="' + long_string_variable + '"]')

Xpath: locate a node by multiple attributes of a parent node

Here is the code:
<li class="abc">
<div class="abc">
<input type="checkbox">
</div>
<div class="xyz">
<div class="headline">Mongo like candy</div>
<div>
</li>
<li class="abc">
<div class="abc">
<input type="checkbox">
</div>
<div class="xyz">
<div class="headline">Candygram for mongo</div>
<div>
</li>
Xpath challenge. I want locate the checkbox of the li which contains the headline "Mongo like candy" so I can select it using Selenium. In other words, how do you locate the checkbox from here:
li//div[#class='abc']//input[#type='checkbox']
but qualifying it with a different attribute within the same parent node:
li//div[#headline][contains(text(),"Mongo like candy")]

The basic idea is to qualify the final path with a predicate, i.e.
li[/*predicate here*/]//div[#class='abc']//input[#type='checkbox']
The predicate expresses the condition on the li that you want:
.//div[#class='headline' and contains(text(), "Mongo like candy")]
Putting them together yields:
li[.//div[#class='headline' and contains(text(), "Mongo like candy")]]//div[#class='abc']//input[#type='checkbox']

something like
li[div[#class='xyz']//div[#class='headline' and contains(text(),"Mongo like candy"))]]//input[#type='checkbox']
unless I messed up parentheses. (that is, you select not just li, but the proper li).

Even this works:
//li[1]/div[1]/input[#type='checkbox']
It may fail if more div tags are introduced in the page.

xPath strange behaviour - selecting ALL elements even if [1] set

today I stumbled upon a very interesting case (at least for me). I am messing around with Selenium and xPath and tried to get some elements, but got a strange behaviour:
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some other text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some even unrelated text
</a>
</div>
</div>
</div>
This is my data.
When i run the following xPath query:
//div[#class="title"][1]/a
I get as a result ALL instead of only the first one. But if I query:
//div[#class="resultcontainer"][1]/div[#class="info"]/div[#class="title"]/a
I get only the first , not all.
Is there some divine reason behind that?
Best regards,
bisko

I think you want
(//div[#class="title"])[1]/a
This:
//div[#class="title"][1]/a
selects all (<a> elements that are children of) <div> elements that have a #class of 'title', that are the first children of their parents (in this context). Which means: it selects all of them.
The working XPath selects all <div> elements that have a #class of 'title' - and of those it takes the first one.
The predicates (the expressions in square brackets []) are applied to each element that matched the preceding location step (i.e. "//div") individually. To apply a predicate to a filtered set of nodes, you need to make the grouping clear with parentheses.
Consequently, this:
//div[1][#class="title"]/a
would select all <div> elements, take the first one, and then filter it down futher by checking the #class value. Also not what you want. ;-)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to select this element with Scrapy XPATH? - xpath

You are not referring to an #class attribute, but rather to an #id attribute with the value thread-navigation. So try this XPath-1.0 expression: r.xpath('//a[ancestor::*/#id="thread-navigation" and contains (text(), "Next")]/#href').get() Its result is I want this text?page=2

This xpath: '//section[#id="thread-navigation"]//a/#href'

Related

Parsing through response created with XPath

Watir: How to retrieve all HTML elements that match an attribute? (class, id, title, etc)

using variables in HtmlXPathSelectors

Xpath: locate a node by multiple attributes of a parent node

xPath strange behaviour - selecting ALL elements even if [1] set

Categories

Resources