Xpath for Scrapy script

Xpath for Scrapy script - xpath

I'm trying to figure out which is the right Xpath without success due, this is the XML I am trying to extract from:
<div class="product-item__info">
Café Tostado y Molido House Blend Starbucks® Bolsa 250 g
<div class="product-item__brand">Starbucks</div>
<div class="product-item__description" style="display: none"></div>
<label class="product-item__compare product-compare">
<input type="checkbox" name="prod-compare-checkbox" class="product-compare__input">
I need the link and the brand. So far I tried, without success:
//div[#class="product-item__brand"]/text() #for Brand
//a[starts-with(#href, "https://www.wong.pe/")] #for link
What could it be? thanks

Related

Parsing through response created with XPath

Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?

Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()

Custom theme for Tumblr: search doesn't work

How can I show search results in my Tumblr theme?
The form:
<form action="/search" method="get">
<input type="text" name="q" value="{SearchQuery}"/>
<input type="submit" value="Search"/>
</form>
The markup:
{block:SearchPage}
????
{/block:SearchPage}
I tried several different variations of this code bot it simply doesn't return any results:
{block:Posts}
{block:SearchPage}
{SearchResultCount}
{Title}
{/block:SearchPage}
....
{/block:Posts}

Try this:
{block:SearchPage}
<div class="result">
<p>{lang:Found SearchResultCount results for SearchQuery 2}</p>
{block:NoSearchResults}<p>{lang:No results for SearchQuery 2}</p>{/block:NoSearchResults}
</div>
{/block:SearchPage}
It must absolutely be outside the Posts block. If you're still having trouble try skimming through a starter theme I've worked on found here. http://git.io/hA6dBA

XPath Getting child elements from html

I am trying to find the xpath for only the child of a navigation bar. The path which I am trying at the moment is //div[#class='navCol subMenus'] from this peace of HTML.
<div class="PrimaryNavigationContainer">
<div class="PrimaryNavigation">
<div class="Menu">
<div>
<span>Brands</span>
<div class="navCol">
<div>
<a class="NoLink unselectable"><span>Shop by Brand</span></a>
<div class="navCol subMenus">
<div>
<span>blah</span>
I have tried a number of Xpath syntax but none seem to work to bring up just the sub categories. Thank you for any help which you can provide.

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.
The attribute value i seek is content found in the <span> with itemprop="price".
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
...
</div>
I can access <div class="left"> but i can't get to the <span> element.
Tried using:
//span[#class='pret']/#content i get #N/A;
//span[#itemprop='price']/#content i get #N/A;
//div[#class='left']/span[#class='pret' and #itemprop='price']/#content i get #N/A;
//div[#class='left']/span[1]/#content i get #N/A;
//div[#class='left']/span/text() to get the text node of <span> i get #N/A;
//div[#class='left']//span/text() i get the text node of a <span> lower in div.left.
To get the text node of <span> i have to use //div[#class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.
It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").
I tried to generate the XPath directly form the development tool by right clicking and i get //*[#id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".
Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[#itemprop='price']/#content works.
I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?

Perhaps the following formulas can help you:
=ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']/text()")
Or
=INDEX(ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']"), 1, 2)
UPDATE
It seems that not properly parse the entire document, it fails. A document extraction, something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
<div class="resealed-info">
» Vezi 1 resigilat din aceasta categorie
</div>
<ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
<li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin BRD</li>
<li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
<li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
</ul>
</div>
<div class="right" style="height: 103px;line-height: 103px;">
<form action="/?a=shopping&sa=addtocart" method="post" id="add_to_cart_form">
<input type="hidden" name="product-183641" value="on"/>
<img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/>
</form>
</div>
</div>
</html>
works with the following XPath query:
"//div[#class='product-info-price']//div[#class='left']//span[#itemprop='price']/#content"
UPDATE
It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:
/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
}
return content;
}
Then you can use as follows:
=MyImportXML("http://...")

At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639
//b[#itemprop='price']
Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.
Before:
<meta itemprop="currency" content="RON">
Now
<meta itemprop="priceCurrency" content="RON" />
For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.

Try smth like this:
print 'content by key',tree.xpath('//*[#itemprop="price"]')[0].get('content')
or
nodes = tree.xpath('//div/meta/span')
for node in nodes:
print 'content =',node.get('content')
But i haven't tried that.

how to pull href link

I am trying to pull a link from a page that is in a formal I can't seem to find by simply googling... it might be simple but xpath is not my area of expertise
I am using c# and trying to pull the link and just write it to the console to figure out how to get the link
here is my C# code
var document = webGet.Load("http://classifieds.castanet.net/cat/vehicles/cars/0_-_4_years_old/");
var browser = document.DocumentNode.SelectSingleNode("//a[starts-with(#href,'/details/')]");
if (browser != null)
{
string htmlbody = browser.OuterHtml;
Console.WriteLine(htmlbody);
}
the html code section is
<div class="last">…</div>13»
<select name="sortby" class="sortby" onchange="doSort(this);">
<option value="">Most Recent</option>
<option value="of" >Oldest First</option>
<option value="mw" >Most Views</option>
<option value="lw" >Fewest Views</option>
<option value="lp" >Lowest Price</option>
<option value="hp" >Highest Price</option>
</select><div style="clear:both"></div>
</div>
<br /><br /><br />
<a href="/details/2008_vw_gti/1454282/" class="prod_container" >
<h2>2008 VW GTi</h2>
<div style="float:left; width:122px; z-index:1000">
<div class="thumb"><img src="http://c.castanet.net/img/28/thumbs/1454282-1-1.jpg" border="0"/></div>
<div class="clear"></div>
mls
</div>
<div class="descr">
The most fun car I have owned. Dolphin Grey, 4 door, Dual Climate control, DRG Transmission with paddle shift. Leather...
</div>
<div class="pdate">
<p class="price">$19,000.00</p>
<p class="date">Kelowna<br />Posted: Oct 15, 2:54 PM<br />Views: 349</p>
</div>
<div style="clear:both" ></div>
<div class="seal"><img src="/images/bookmark.png" /></div>
</a>
<a href="/details/price_drop_gorgeous_rare_white_2009_honda_accord_ex-l_coupe/1447341/" class="prod_container" >
<h2>PRICE DROP!!! Gorgeous Rare White 2009 Honda Accord EX-L Coupe </h2>
<div style="float:left; width:122px; z-index:1000">
<div class="thumb"><img src="http://c.castanet.net/img/28/thumbs/1447341-1-1.jpg" border="0"/></div>
<div class="clear"></div>
sun2010
</div>
<div class="descr">
the link I'm trying to get is the "/details/2008_vw_gti/1454282/" part. THanks

HTML isn't XML.
XPath is a tool for navigating through XML documents, however HTML does not conform to XML requirements. The HTML you've linked about isn't well formed XML, and as such XPath won't work.
You either need to look at using an HTML to XML convertor, and then adding the output of that conversion to your question to write XPath against, or investigate using a different tool for the job. I'd suggest doing a Google search for "C# HTML scrapers", but I'm not familiar with .Net so I can't offer a narrower option.

Try the following Xpath expression :
//a[#class="prod_container"]/#href

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Xpath for Scrapy script - xpath

Related

Parsing through response created with XPath

Custom theme for Tumblr: search doesn't work

XPath Getting child elements from html

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

how to pull href link

Categories

Resources