XPATH in difficult span - Scrapy - xpath

I use Scrapy and write script based on XPATH selector. I try search XPATH syntax to collect two value: price and EAN number (500.02, 08043687312822). Price: 500,2 and EAN: 08043687312822
<div class="emProductPrice">
<span itemprop="offers" itemscope="" itemtype="http://example.com/Offer">
<span
itemprop="price" content="500.02">500,02</span> hrywna<meta itempr\
op="priceCurrency" content="PLN">
<meta itemprop="gtin14"
content="08043687312822">
<link itemprop="itemCondition"
href="http://example.com/NewCondition">
<l\ink itemprop="availability" href="http://example.com/InStock">
</span>
</div>
I try write syntax something like: //div[#class="emProductPrice"/span/span/text() but i get only: &nbsp. I need 500,02 for example
How do this? Please help.

You need:
price = response.xpath('//span[#itemprop="price"]/#content').extract_first()
ean = response.xpath('//meta[#itemprop="gtin14"]/#content').extract_first()

Related

how to get specific text after a div with xpath

I get trouble to get specific texts which are located between two tags.
I mean, want to get Text after em tag. I want to get this. and also text after this p tag. I also want to get this..
is there any way of doing that?
thanks in advance.
<article>
<h1 id='h1'>Heading 1</h1>
<img src='mypath/pictures/pic.jpg'></img>
<p></p>
<div id='div1'>
<time datetime='2016'>2016</time>
</div>
<br></br>
<em>my location, TN, United States</em>
Text after em tag. I want to get this.
<p></p>
text after this p tag. I also want to get this.
<div id='div2'>
</div>
</article>
you can get the following sibling texts by using
following-sibling::text()
so to get all the em after text
//em/following-sibling::text()[1]
the same will be for p tag, and then join them
string-join(em/following-sibling::text()[1] | p/following-sibling::text()[1] , ',')
I hope this could help!

Google Structured Data Testing Tool dont validate goodrelations extension

<div
itemscope="itemscope"
itemtype="http://schema.org/Product"
itemid="urn:mpn:123456789">
<link
itemprop="additionalType"
href="http://www.productontology.org/id/Lawn_mower">
<span
itemprop="http://purl.org/goodrelations/v1#category"
content="Lawn mower">
Lawn mower
</span>
</div>
There is above an fragment of my markup and when I put on Google Structured Data Testing Tool I'm receiving the error:
'Error: Page contains property "http://purl.org/goodrelations/v1#category" which is not part of the schema.'.
I was thinking about remove microdata from span tag and keep only the link tag above with microdata to make it validate.
On [http://www.productontology.org/doc/Lawn_mower] there is the statement : "Breaking news: schema.org has just implemented our proposal to define an additionalType property with the use of this service in mind!" and I think it means it is compatible.
This error can impact my SEO? There is some advise to me? I searched about it a lot and can't found anything related.
The final markup after #daviddeering help:
<div itemscope="itemscope" itemtype="http://schema.org/Product" itemid="urn:mpn:123456789">
<a href="http://127.0.0.1/jkr/123456789" itemprop="url">
<img itemprop="image" alt="Partnumber:123456789" src="http://127.0.0.1/jkr/img/123456789.jpg" content="http://127.0.0.1/jkr/img/123456789.jpg">
<span itemprop="name">123456789 - Bosh lawn mower</span>
</a>
<span>PartNumber: </span>
<span itemprop="mpn">123456789</span>
<span>Line: </span>
<span itemprop="additionalType" href="http://www.productontology.org/id/Lawn_Mower">Lawn mower</span>
<span>Manuf.: </span>
<div itemscope="itemscope" itemprop="manufacturer"
itemtype="http://schema.org/Organization"><span itemprop="name">Bosh</span>
</div>
<div itemprop="offers" itemscope="itemscope" itemtype="http://schema.org/Offer">
<meta itemprop="availabilityStarts" content="2013-10-20 05:27:36"><span itemprop="priceCurrency" content="USD">USS</span><span itemprop="price" content="565.29">565,29*</span>
<link itemprop="availability" href="http://schema.org/OutOfStock"><span itemprop="inventoryLevel" content="0">Ask for it</span>
</div>
</div>
Well the product schema must always include a name. And the structure of your last itemprop line was incorrect. So the following code tested fine in Google's testing tool:
<div
itemscope="itemscope"
itemtype="http://schema.org/Product"
itemid="urn:mpn:123456789">
<span itemprop="name">Name of Lawn Mower</span>
<link
itemprop="additionalType"
href="http://www.productontology.org/id/Lawn_mower">
<span rel="gr:hasBusinessFunction" resource="http://purl.org/goodrelations/v1#sell"
content="Lawn mower">
Lawn mower
</span>
</div>
Although in your case, I'm not sure if it's necessary to combine the product schema and the GoodRelations markup. You could create the entire markup using just GoodRelations, or you could use schema.org and simply use the tag [link
itemprop="additionalType"
href="http://www.productontology.org/id/Lawn_mower"] where it currently is in the code then continue using schema to mark up the rest.

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.
The attribute value i seek is content found in the <span> with itemprop="price".
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
...
</div>
I can access <div class="left"> but i can't get to the <span> element.
Tried using:
//span[#class='pret']/#content i get #N/A;
//span[#itemprop='price']/#content i get #N/A;
//div[#class='left']/span[#class='pret' and #itemprop='price']/#content i get #N/A;
//div[#class='left']/span[1]/#content i get #N/A;
//div[#class='left']/span/text() to get the text node of <span> i get #N/A;
//div[#class='left']//span/text() i get the text node of a <span> lower in div.left.
To get the text node of <span> i have to use //div[#class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.
It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").
I tried to generate the XPath directly form the development tool by right clicking and i get //*[#id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".
Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[#itemprop='price']/#content works.
I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?
Perhaps the following formulas can help you:
=ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']/text()")
Or
=INDEX(ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']"), 1, 2)
UPDATE
It seems that not properly parse the entire document, it fails. A document extraction, something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
<div class="resealed-info">
ยป Vezi 1 resigilat din aceasta categorie
</div>
<ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
<li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin BRD</li>
<li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
<li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
</ul>
</div>
<div class="right" style="height: 103px;line-height: 103px;">
<form action="/?a=shopping&sa=addtocart" method="post" id="add_to_cart_form">
<input type="hidden" name="product-183641" value="on"/>
<img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/>
</form>
</div>
</div>
</html>
works with the following XPath query:
"//div[#class='product-info-price']//div[#class='left']//span[#itemprop='price']/#content"
UPDATE
It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:
/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
}
return content;
}
Then you can use as follows:
=MyImportXML("http://...")
At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639
//b[#itemprop='price']
Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.
Before:
<meta itemprop="currency" content="RON">
Now
<meta itemprop="priceCurrency" content="RON" />
For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.
Try smth like this:
print 'content by key',tree.xpath('//*[#itemprop="price"]')[0].get('content')
or
nodes = tree.xpath('//div/meta/span')
for node in nodes:
print 'content =',node.get('content')
But i haven't tried that.

schema.org / microdata - Product or Offer?

I am having trouble using MicroFormats and working out which itemtype to use, either Product or Offer. I have used Offer to add data to the various products that we sell (1 per page). Although this validates properly in the Google Structured Data testing tool it will not show the Price/Rating/InStock in the results. If I use a mixture of Product and Offer then it will although I am not sure this is the correct way to do this ?
Thanks,
Rick
<title>My Tent</title>
<div itemscope itemtype="http://schema.org/Offer">
<div itemprop="name" class="product-details-title" id="item_product_prop">My Tent</div>
<div itemprop="description" id="item_product_prop">A Description for MyTent</div>
<meta itemprop="aggregateRating" id="item_product_prop" content="[3 Ratings]">
<div id="item_product_prop" itemprop="price">$13</div>
<div itemprop="availability" id="item_product_prop" content="InStock"></div></div>
</div>
http://www.google.com/webmasters/tools/richsnippets?q=uploaded:8004e1c78c1098daa7aa283c26b42939
If a product is offered for sale on your website, you actually need both Offer and Product on a product page. An Offer should live within Product.
Google also requires price and priceCurrency (both properties of Offer).
Good example is provided at schema.org/Product page. Use it as reference example and you will get what you want.
<div itemscope itemtype="http://schema.org/Product">
<span itemprop="name">Kenmore White 17" Microwave</span>
<img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
<div itemprop="aggregateRating"
itemscope itemtype="http://schema.org/AggregateRating">
Rated <span itemprop="ratingValue">3.5</span>/5
based on <span itemprop="reviewCount">11</span> customer reviews
</div>
<div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<span itemprop="price">$55.00</span>
<link itemprop="availability" href="http://schema.org/InStock" />In stock
</div>
</div>
results in
I think you need an extra level in your offer.
If you look at http://www.heppnetz.de/ontologies/goodrelations/goodrelations-UML.png, you see that an offer includes one or more products.

review count and rating using an image - schema.org

I need some help getting some rich snippets to my site
I inserted the review microdata following the instructions given on schema.org here http://schema.org/docs/gs.html#advanced_missing using the star-image for rating and the text for review count, but testing it with the test tool it showed nothing.
Example page where we use the microdata for the reviews.
and here is what I used
<div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
<img src="/images/stars/4.5.gif" border=0>
<meta itemprop="ratingValue" content="4.5" />
<meta itemprop="bestRating" content="5" />
<BR>
<span class="bottomnavfooter">
<A HREF="javascript:an();">Read (<span itemprop="ratingCount">70</span>) Reviews</A
</span>
</div>
I then created a static test page and made some change using instructions Google provided here http://www.google.com/support/webmasters/bin/answer.py?answer=172705 (which is different from what I found on schema.org!!) but still the test returned only product name not the price or the reviews.
Here is my test page - Can you please see where I'm going wrong
Thanks much!!
The above code snippet will fail because it has an itemprop for aggregateRating, but isn't enclosed in an itemscope. It also doesn't help that your final anchor close tag is missing a >, but I guess that was just an accident when you were copying the code into SO.
The other problem mainly brought about because the example on the schema.org site is wrong (I have filed a bug report on this). They mention itemprop="reviews" instead of itemprop="aggregateRating". The code should look more like the following:
<div itemscope itemtype="http://schema.org/Offer">
<span itemprop="name">Ray-Ban 2132 New Wayfarer Sunglasses</span>
<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<img src="/images/stars/4.5.gif" border=0>
<meta itemprop="ratingValue" content="4.5" />
<meta itemprop="bestRating" content="5" />
<br />
<span class="bottomnavfooter">
Read (<span itemprop="ratingCount">70</span>) Reviews
</span>
</div>
</div>

Resources