Parsing through response created with XPath - xpath

Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?

Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()

Related

XPath Query for URL Extraction

I need to extract http://site.ru/ from this code:
<div class="one">
<dl>
<dt class="two">
<span class="name">Site</span>
</dt>
<dd class="three">
<span class="js-pseudo-link" data-url="rAnDoMlEtTeRsAnDnUmBeRs" style>
<a href="http://site.ru/" class rel="nofollow" target="_blank" style> http://site.ru/ </a>
</span>
</dd>
</dl>
</div>
I use this XPath query: //div//dl//dd//span//a/#href
But it doesn't work. It doesn't return anything.
I'm a newbie in XPath.
Unfortunately, the data source you are looking for is an empty span node (class js-pseudo-link). The data-url attribute has the base64 encoded link you want. This node only gets populated after loading. ImportXML for some reason ignores nodes with no text and there's no way to get it not to do that. To get around this, looks like you'll have to write an apps script that can handle empty nodes or just gets the raw HTML code and parse it.

How to properly get the value contained inside a section using XPath?

having the following HTML (snippet grabbed from the web page I wanted to scrape):
<div class="ulListContainer">
<section class="stockUpdater">
<ul class="column4">
<li>
<img src="1.png" alt="">
<strong>
Buy*
</strong>
<strong>
Sell*
</strong>
</li>
<li>
<header>
$USD
</header>
<span class="">
20.90
</span>
<span class="">
23.15
</span>
</li>
</ul>
<ul>...</ul>
</section>
</div>
how do I get the 2nd li 1st span value using XPath? The result should be 20.90.
I have tried the following //div[#class="ulListContainer"]/section/ul[1]/li[2]/span[1] but I am not getting any values. I must said this is being used from a Google Sheet and using the function IMPORTXML (not sure what version of XPath it does uses) can I get some help?
Update
Apparently Google Sheets does not support such "complex" XPath expression since it seems to work fine:
Update 1
As requested I've shared the Google Sheet I am using to test this, here is the link
What you need is :
=IMPORTXML(A1;"//li[contains(text(),'USD')]/span[1]")
Removing section from your original XPath will work too :
=IMPORTXML(A1;"//div[#class='ulListContainer']/ul[1]/li[2]/span[1]")
Try this:
=IMPORTXML("URL","//span[1]")
Change URL to the actual website link/URL

XPath Getting child elements from html

I am trying to find the xpath for only the child of a navigation bar. The path which I am trying at the moment is //div[#class='navCol subMenus'] from this peace of HTML.
<div class="PrimaryNavigationContainer">
<div class="PrimaryNavigation">
<div class="Menu">
<div>
<span>Brands</span>
<div class="navCol">
<div>
<a class="NoLink unselectable"><span>Shop by Brand</span></a>
<div class="navCol subMenus">
<div>
<span>blah</span>
I have tried a number of Xpath syntax but none seem to work to bring up just the sub categories. Thank you for any help which you can provide.

using variables in HtmlXPathSelectors

I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]').
I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in.
Update:
So for the sample text below I want to extract the div where id="footer":
<div id="footer">
<div id="footer-menu">
<div class="region-footer-menu">
<div id="block-menu-menu-footer-menu" class="block-menu">
<div class="content">
<ul class="menu">
<li class="first leaf">FAQs</li>
<li class="leaf">Media</li>
<li class="leaf">Partners</li>
<li class="last leaf active-trail">Jobs</li>
</ul>
</div>
</div>
<div id="block-block-52" class="block block-block">
<div class="content">
<p>SUPPORT</p>
</div>
</div>
</div>
</div>
</div>
We initialize hxs = HtmlXPathSelector(response) for all the below segments.
The following code selects only the first div:
hxs.select('//div[#id=concat("foot","er")]')
This code selects nothing but gives no error:
hxs.select('//div[#id="foot"+"er"]')
Both of the below code segments select nothing and give no errors:
long_string = "foot"
hxs.select('//div[#id=concat(long_string,"er")]')
hxs.select('//div[#id=long_string]')
I would like to be able to do either of the bottom two methods and return the desired results.
Assuming + works for string concatenation in Scrapy, this should work:
hxs.select('//div[#id="' + long_string + '"]')
I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.
have you tried?
hxs.select('\\div[#id="' + long_string_variable + '"]')

Accessing a div element in an array of li elements

I am trying to access a div in an li array
<ul>
<li class="views-row views-row-1 views-row-odd views-row-first">
<div class="news-item">
</li>
<li class="views-row views-row-2 views-row-even">
<li class="views-row views-row-3 views-row-odd">
<div class="news-item">
<div class="image">
<div class="details with-image">
<h2>
<p class="standfirst">The best two-seat </p>
<div class="meta">
<div class="pub-date">26 April 2012</div>
<div class="topic-bar clearfix">
<div class="topic car_review">review</div>
</div>
</div>
</div>
</div>
</li>
I am trying to access the "div class="topic car_review">car review "and get its text.
The reason I am specifically using that text is that, depending on what the text is it would enter specific steps.
Code that I am using is
#topic = #browser.li(:class => /views-row-#{x}/).div(:class,'news-item').div(:class,'details').div(:class,'meta').div(:class,/topic /).text
The script was working fine before and suddenly it has stopped working and is just not able to get the div(:class,'news-item').
The error message I get is
unable to locate element, using {:class=>"news-item", :tag_name=>"div"} (Watir::Exception::UnknownObjectException)
I tried div(:class => /news-/) but still its just not able to find that element
I am really stuck!!!
I assume that when you are doing li(:class => /views-row-#{x}/), the x means you are iterating over all rows? If so, then your script will fail on the row-2 since it does not contain the news-item div (resulting in the error that you see).
If there is only one of these 'topic car_review' div tags, you can just do:
#topic = #browser.div(:class, 'topic car_review')
Update - Iterating over each LI:
If you need to iterate over each LI, then you could do:
#browser.lis.each do |li|
#topic = li.div(:class, 'topic car_review').text
end

Resources