Scrapy and XPath issue with nested Xpaths - xpath

I'm trying to read Amazon products into scrapy.
Starting from a random category using this XPath:
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
item['url'] = product.xpath('//a[#class="s-access-detail-page"]/#href').extract()[0]
yield item
('//div[#class="s-item-container"]') returns all the divs with the products on one category page - that's correct.
Now, how would I get the link to the product?
// stands for where ever in the code
a with the #class should select the right class
But I get a:
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
exceptions.IndexError: list index out of range
So my list matching this XPath must be empty - but I don't understand why?
EDIT:
The HTML would look like that:
<div class="s-item-container" style="height: 343px;">
<div class="a-row a-spacing-base">
<div class="a-column a-span12 a-text-left">
<div class="a-section a-spacing-none a-inline-block s-position-relative">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer"><img alt="Product Details" src="http://ecx.images-amazon.com/images/I/41%2BzrAY74UL._AA160_.jpg" onload="viewCompleteImageLoaded(this, new Date().getTime(), 24, false);" class="s-access-image cfMarker" height="160" width="160"></a>
<div class="a-section a-spacing-none a-text-center">
<div class="a-row a-spacing-top-mini">
<a class="a-size-mini a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<div class="a-box">
<div class="a-box-inner a-padding-mini"><span class="a-color-secondary">See more choices</span></div>
</div>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none">
<a class="a-link-normal s-access-detail-page a-text-normal" title="Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<h2 class="a-size-base a-color-null s-inline s-access-title a-text-normal">Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)</h2>
</a>
</div>
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">Legend</span></div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="http://www.amazon.com/gp/offer-listing/B0105S434A/ref=sr_1_21_olp?s=pet-supplies&ie=UTF8&qid=1435391788&sr=1-21&keywords=pet+supplies&condition=new"><span class="a-size-base a-color-price a-text-bold">$28.99</span><span class="a-letter-space"></span>new<span class="a-letter-space"></span><span class="a-color-secondary">(1 offer)</span><span class="a-letter-space"></span><span class="a-color-secondary a-text-strike"></span></a></div>
</div>
<div class="a-row a-spacing-none"><span name="B0105S434A">
<span class="a-declarative" data-action="a-popover" data-a-popover="{"max-width":"700","closeButton":"false","position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&asin=B0105S434A&contextId=search&ref=acr_search__popover"}"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">3.9 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></span></span>
<a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">48</a>
</div>
</div>

It should be:
# ------------- The dot makes the query relative to product
product.xpath('.//a[#class="s-access-detail-page"]/#title')

//a[#class="s-access-detail-page"] requires to be exactly class="s-access-detail-page", because xpath works with string but not with meaning :) When you have "multi class ", use contains function
//a[contains(concat(' ', #class, ' '), " s-access-detail-page ")]/#title

Related

HtmlAgilityPack SelectNodes InnerText returns placeholder text not actual value

in my app i want to scrape a web page to extract the values i am interesting.
(ShopData is the HtmlNodeCollection)
and my C# code like this :
var ShopName = ShopData.SelectNodes(".//div[#class='shop-name']");
this return null
if i try this, returns the node:
var ShopName1 = ShopData.SelectNodes(".//div[contains(#class, 'shop cf')]")
why .//div[#class='shop-name'] does not work?
if i do ShopData.SelectNodes(".//div[contains(#class, 'shop cf')]").ToList()[0];
then the innertext is empty.
the same time ShopData.SelectNodes(".//div[#class='price']").ToList()[0].InnerText return text normally.
what are the difference between this 2 functions?
my web page looks like this:
<li class="cf card js-product-card">
<div class="shop cf">
<div class="shop-logo js-shop-logo">
<img class="fade-in" src="//a.scdn.gr/ds/shops/logos/2870/mid_20181210114648_f305ba08.jpeg" data-src="//a.scdn.gr/ds/shops/logos/2870/mid_20181210114648_f305ba08.jpeg" alt="Electroholic">
</div>
<i class="icon tooltip-parent js-tooltip-handler trustmark" data-trigger="toggle" data-type="string" data-theme="light" data-content="Το κατάστημα διαθέτει πιστοποίηση GRECA Trustmark που σημαίνει ότι έχει δεσμευτεί να εργαστεί σύμφωνα με τον Eλληνικό και Ευρωπαϊκό (αντίστοιχα) Κώδικα Ηλεκτρονικού Εμπορίου, διασφαλίζοντας δεοντολογικά πρότυπα στην ψηφιακή αγορά.<div>Περισσότερες πληροφορίες στην <a href='http://www.greekecommerce.gr/' target='_blank'>ιστοσελίδα του GRECA.</a></div>" data-placement="left">
<span>GRECA Trustmark</span>
</i>
<div class="shop-name">Electroholic</div>
</div>
<div class="description">
<div class="item">
<h3>
<a title="Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις" rel="nofollow" class="js-product-link content-placeholder" data-type="title" href="/products/show/32755241">
Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις</a>
</h3>
<p class="availability"><span class="availability">Παράδοση έως 30 ημέρες</span></p>
</div>
</div>
<div class="price">
<div class="">
<div class="price-content"><a title="Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις" rel="nofollow" class="js-product-link product-link content-placeholder" data-type="net_price" href="/products/show/32755241">358,00 €</a><span class="extra-cost cf"><em>+ 9,00 €</em> <span>Μεταφορικά</span></span><span class="extra-cost cf"><em>+ 2,00 €</em> <span>Αντικαταβολή</span></span><span class="final-price"><a title="Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις" rel="nofollow" class="js-product-link content-placeholder" data-type="final_price" href="/products/show/32755241">369,00 €</a></span></div>
</div>
</div>
<div class="shop-details react-expander-bottom js-product-uservoice"><span class="payment-options"><i class="icon tooltip-parent js-tooltip-handler trustmark" data-trigger="toggle" data-type="string" data-theme="light" data-content="Το κατάστημα διαθέτει πιστοποίηση GRECA Trustmark που σημαίνει ότι έχει δεσμευτεί να εργαστεί σύμφωνα με τον Eλληνικό και Ευρωπαϊκό (αντίστοιχα) Κώδικα Ηλεκτρονικού Εμπορίου, διασφαλίζοντας δεοντολογικά πρότυπα στην ψηφιακή αγορά.<div>Περισσότερες πληροφορίες στην <a href='http://www.greekecommerce.gr/' target='_blank'>ιστοσελίδα του GRECA.</a></div>" data-placement="auto vertical"><span>GRECA Trustmark</span></i>
</span>
<div class="shop-expander-tabs">
<button class="shop-tab js-shop-tab icon ">
<div class="rating-with-count react-component">
<a class="rating stars" title="3,9 αστέρια από 1493 χρήστες" href="#reviews">
<div class="rating-wrapper">
<div class="actual-rating blue" itemprop="" style="width: 78%;">1493</div><span itemprop="">3,9</span></div>
</a>
<div class="reviews-count blue">
<a title="1493 αξιολογήσεις χρηστών" href="#reviews">1493</a></div>
</div>
</button>
<button class="shop-tab js-shop-tab icon location-tab multi-shops ">
<span>Περιστέρι, Αττική</span></button>
</div>
<div class="shop-info-object js-shop-info-expander ">
</div>
</div>
</li>
OK, i figure out what is goin on.
the web page use AJAX calls and thats why i cannot see them.

KnockoutJS elements not rendered once loaded via Jquery Ajax function

I have loaded a sidebar over ajax however this html uses knockoutJS to render completely. I am wondering how to execute the KnockoutJs portions of this code.
The content below is loaded via jQuery ajax function and contains a number of knockout elements as well as some X Magento Init type scripts:
<div class=\"block filter\" id=\"layered-filter-block\" data-mage-init='{\"collapsible\":{\"openedState\": \"active\", \"collapsible\": true, \"active\": false, \"collateral\": { \"openedState\": \"filter-active\", \"element\": \"body\" } }}'>
<div class=\"block-title filter-title\" data-count=\"0\">
<strong data-role=\"title\">Shop By<\/strong>
<\/div>
<div class=\"block-content filter-content\">
<strong role=\"heading\" aria-level=\"2\" class=\"block-subtitle filter-subtitle\">Shopping Options<\/strong>
<div class=\"filter-options\" id=\"narrow-by-list\" data-role=\"content\" data-mage-init='{\"accordion\":{\"openedState\": \"active\", \"collapsible\": true, \"active\": [0,1,2], \"multipleCollapsible\": true}}'>
<div data-role=\"collapsible\" class=\"filter-options-item\">
<div data-role=\"title\" class=\"filter-options-title\">Category<\/div>
<div data-role=\"content\" class=\"filter-options-content\">\n<ol class=\"items\">
<li class=\"item\">
<a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=143&q=ice+machine\">Front of House
<span class=\"count\">2<span class=\"filter-count-label\">items<\/span><\/span><\/a>
<\/li>
<li class=\"item\">
<a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=182&q=ice+machine\">Bar Supplies
<span class=\"count\">4<span class=\"filter-count-label\">items<\/span><\/span><\/a>
<\/li>
<li class=\"item\">
<a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=257&q=ice+machine\">Catering Equipment<span class=\"count\">111<span class=\"filter-count-label\">\n
items <\/span><\/span>\n
<\/a>\n <\/li>\n
<li class=\"item\">\n
<a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=342&q=ice+machine\">\n
Warewashing <span class=\"count\">\n
3 <span class=\"filter-count-label\">\n
items <\/span><\/span>\n
<\/a>\n <\/li>\n <li class=\"item\">\n
<a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=521&q=ice+machine\">\n
Catering Equipment Offers <span class=\"count\">\n 1
<span class=\"filter-count-label\">\n item <\/span><\/span>\n
<\/a>\n <\/li>\
<\/ol>
<\/div>\n
<\/div>\n
<div data-role=\"collapsible\" class=\"filter-options-item\">
<div data-role=\"title\" class=\"filter-options-title\">Brand<\/div>\n
<div data-role=\"content\" class=\"filter-options-content\">
<div data-bind=\"scope: 'brandFilter'\">
<!-- ko template: getTemplate() --> <!-- \/ko -->
<\/div>
<script type=\"text\/x-magento-init\">
{\"*\" : {\"Magento_Ui\/js\/core\/app\": {\"components\": {\"brandFilter\": {\"component\":\"Smile_ElasticsuiteCatalog\\\/js\\\/attribute-filter\",\"maxSize\":10,\"displayProductCount\":true,\"hasMoreItems\":true,\"ajaxLoadUrl\":\"http:\\\/\\\/www.domain.com\\\/catalog\\\/navigation_filter\\\/ajax\\\/?ajax=1&filterName=brand&q=ice+machine\",\"items\":[{\"label\":\"Scotsman\",\"count\":41,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Scotsman&q=ice+machine\",\"is_selected\":false},{\"label\":\"Hoshizaki\",\"count\":15,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Hoshizaki&q=ice+machine\",\"is_selected\":false},{\"label\":\"Ice-o-matic\",\"count\":12,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Ice-o-matic&q=ice+machine\",\"is_selected\":false},{\"label\":\"Blue Ice\",\"count\":7,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Blue+Ice&q=ice+machine\",\"is_selected\":false},{\"label\":\"Graupel\",\"count\":7,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Graupel&q=ice+machine\",\"is_selected\":false},{\"label\":\"Nemox\",\"count\":7,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Nemox&q=ice+machine\",\"is_selected\":false},{\"label\":\"Manitowoc\",\"count\":6,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Manitowoc&q=ice+machine\",\"is_selected\":false},{\"label\":\"Polar Refrigeration\",\"count\":5,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Polar+Refrigeration&q=ice+machine\",\"is_selected\":false},{\"label\":\"Longo & Co\",\"count\":4,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Longo+%26+Co&q=ice+machine\",\"is_selected\":false},{\"label\":\"Beaumont\",\"count\":3,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Beaumont&q=ice+machine\",\"is_selected\":false}]}}}}}\n<\/script>\n\n<\/div>\n <\/div>\n <div data-role=\"collapsible\" class=\"filter-options-item\">\n <div data-role=\"title\" class=\"filter-options-title\">Power<\/div>\n <div data-role=\"content\" class=\"filter-options-content\"><div data-bind=\"scope: 'power_ddFilter'\">\n <!-- ko template: getTemplate() --> <!-- \/ko -->\n<\/div>\n\n<script type=\"text\/x-magento-init\">\n {\"*\" : {\"Magento_Ui\/js\/core\/app\": {\"components\": {\"power_ddFilter\": {\"component\":\"Smile_ElasticsuiteCatalog\\\/js\\\/attribute-filter\",\"maxSize\":10,\"displayProductCount\":true,\"hasMoreItems\":false,\"ajaxLoadUrl\":\"http:\\\/\\\/www.domain.com\\\/catalog\\\/navigation_filter\\\/ajax\\\/?ajax=1&filterName=power_dd&q=ice+machine\",\"items\":[{\"label\":\"13 Amp (Plug)\",\"count\":111,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&power_dd=13+Amp+%28Plug%29&q=ice+machine\",\"is_selected\":false},{\"label\":\"1 Phase (Hard Wired)\",\"count\":2,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&power_dd=1+Phase+%28Hard+Wired%29&q=ice+machine\",\"is_selected\":false}]}}}}}\n<\/script>\n\n<\/div>\n <\/div>\n <div data-role=\"collapsible\" class=\"filter-options-item\">\n <div data-role=\"title\" class=\"filter-options-title\">Price<\/div>\n <div data-role=\"content\" class=\"filter-options-content\"><div class=\"smile-es-range-slider\" data-role=\"range-price-slider-price\">\n <div data-role=\"from-label\"><\/div>\n <div data-role=\"to-label\"><\/div>\n <div data-role=\"slider-bar\"><\/div>\n <div class=\"actions-toolbar\">\n <div data-role=\"message-box\"><\/div>\n <div class=\"actions-primary\">\n <a class=\"action primary small\" data-role=\"apply-range\">\n <span>OK<\/span>\n <\/a>\n <\/div>\n <\/div>\n<\/div>\n\n<script type=\"text\/x-magento-init\">\n { \"[data-role=range-price-slider-price]\" : { \"rangeSlider\" : {\"minValue\":1,\"maxValue\":6091,\"currentValue\":{\"from\":1,\"to\":6091},\"fieldFormat\":{\"pattern\":\"\\u00a3%s\",\"precision\":2,\"requiredPrecision\":2,\"decimalSymbol\":\".\",\"groupSymbol\":\",\",\"groupLength\":3,\"integerRequired\":false},\"intervals\":[{\"value\":1,\"count\":1},{\"value\":2,\"count\":1},{\"value\":3,\"count\":1},{\"value\":40,\"count\":1},{\"value\":60,\"count\":1},{\"value\":64,\"count\":1},{\"value\":150,\"count\":1},{\"value\":179,\"count\":1},{\"value\":190,\"count\":1},{\"value\":242,\"count\":1},{\"value\":291,\"count\":1},{\"value\":325,\"count\":1},{\"value\":355,\"count\":2},{\"value\":395,\"count\":1},{\"value\":465,\"count\":1},{\"value\":472,\"count\":1},{\"value\":515,\"count\":1},{\"value\":520,\"count\":1},{\"value\":535,\"count\":1},{\"value\":555,\"count\":1},{\"value\":577,\"count\":1},{\"value\":585,\"count\":1},{\"value\":599,\"count\":1},{\"value\":605,\"count\":2},{\"value\":615,\"count\":1},{\"value\":640,\"count\":1},{\"value\":658,\"count\":1},{\"value\":685,\"count\":1},{\"value\":705,\"count\":1},{\"value\":730,\"count\":1},{\"value\":745,\"count\":2},{\"value\":785,\"count\":1},{\"value\":805,\"count\":1},{\"value\":830,\"count\":1},{\"value\":895,\"count\":2},{\"value\":925,\"count\":1},{\"value\":965,\"count\":1},{\"value\":970,\"count\":1},{\"value\":990,\"count\":2},{\"value\":1030,\"count\":1},{\"value\":1065,\"count\":1},{\"value\":1080,\"count\":1},{\"value\":1085,\"count\":1},{\"value\":1095,\"count\":1},{\"value\":1105,\"count\":1},{\"value\":1130,\"count\":1},{\"value\":1155,\"count\":1},{\"value\":1225,\"count\":1},{\"value\":1235,\"count\":1},{\"value\":1240,\"count\":1},{\"value\":1259,\"count\":1},{\"value\":1310,\"count\":1},{\"value\":1360,\"count\":1},{\"value\":1365,\"count\":1},{\"value\":1450,\"count\":1},{\"value\":1485,\"count\":1},{\"value\":1495,\"count\":1},{\"value\":1510,\"count\":1},{\"value\":1580,\"count\":2},{\"value\":1605,\"count\":2},{\"value\":1685,\"count\":1},{\"value\":1710,\"count\":1},{\"value\":1779,\"count\":1},{\"value\":1785,\"count\":1},{\"value\":1865,\"count\":1},{\"value\":1870,\"count\":1},{\"value\":1885,\"count\":1},{\"value\":1890,\"count\":1},{\"value\":1970,\"count\":1},{\"value\":1995,\"count\":1},{\"value\":2000,\"count\":1},{\"value\":2050,\"count\":1},{\"value\":2130,\"count\":1},{\"value\":2199,\"count\":1},{\"value\":2220,\"count\":1},{\"value\":2345,\"count\":1},{\"value\":2350,\"count\":1},{\"value\":2360,\"count\":1},{\"value\":2405,\"count\":1},{\"value\":2415,\"count\":1},{\"value\":2445,\"count\":1},{\"value\":2450,\"count\":2},{\"value\":2480,\"count\":1},{\"value\":2500,\"count\":1},{\"value\":2530,\"count\":1},{\"value\":2565,\"count\":1},{\"value\":2570,\"count\":1},{\"value\":2595,\"count\":1},{\"value\":2695,\"count\":1},{\"value\":2730,\"count\":1},{\"value\":2825,\"count\":1},{\"value\":2850,\"count\":1},{\"value\":2950,\"count\":1},{\"value\":2995,\"count\":1},{\"value\":3010,\"count\":1},{\"value\":3025,\"count\":1},{\"value\":3145,\"count\":1},{\"value\":3205,\"count\":1},{\"value\":3295,\"count\":1},{\"value\":3300,\"count\":1},{\"value\":3485,\"count\":1},{\"value\":3495,\"count\":1},{\"value\":3580,\"count\":1},{\"value\":4015,\"count\":1},{\"value\":4075,\"count\":1},{\"value\":4305,\"count\":1},{\"value\":4310,\"count\":1},{\"value\":4595,\"count\":1},{\"value\":4620,\"count\":1},{\"value\":5250,\"count\":1},{\"value\":5355,\"count\":1},{\"value\":6090,\"count\":1}],\"urlTemplate\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&price=<%- from %>-<%- to %>&q=ice+machine\",\"messageTemplates\":{\"displayCount\":\"<%- count %> products\",\"displayEmpty\":\"No products in the selected range.\"},\"rate\":1}
} }
<\/script>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
These are then added to a block on my page via html jQuery method:
$(sidebarBlock).html(this.filters);
Looking at the DOM I cannot actually see the scripts however they are there in response when reviewing with console.log(). Similarly the below shows the scripts are present:
$(sidebar).find("script").each(function() {
console.log("found a script");
}
I have tried to use .trigger('contentUpdated'); like below:
document.getElementById("layered-filter-block").innerHTML = this.filters;
$(sidebarBlock).trigger('contentUpdated');
and:
$(sidebarBlock).html(this.filters);
$(sidebarBlock).trigger('contentUpdated');
and by reapplying bindings for knockout:
ko.cleanNode($('#layered-filter-block'));
ko.applyBindings($('#layered-filter-block'));
The above throws an error about bindings already being applied however but I have used cleanNode before to unbind however error persists.
This fixed issue for me:
$(sidebarBlock).applyBindings();
https://codeblog.experius.nl/magento-2-uicomponent-reinit-ajax-reload/

Remove leading or trailing whitespace of string except html tag in ruby

I want to remove leading or trailing whitespace of string except html tag
example
html = <a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\">\n <div class=\"c-flex\">\n <div class=\"c-grid__quotation--main\">\n <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" />\n </div>\n <div class=\"c-grid__quotation--side\">\n <div class=\"c-grid__quotation--side-title text--b\">\n Yahoo\n </div>\n <div class=\"c-grid__quotation--side-description\">\n News, email and search are just the beginning. Discover more every day. Find your yodel.\n </div>\n <div class=\"c-grid__quotation--side-url\">\n www.yahoo.com\n </div>\n </div>\n </div>\n</div></a>
My way of Doing this
html.gsub(/>\s{1,8}</, "><").gsub(/>\s{1,8}/, ">").gsub(/\s{1,8}</, "<")
How to remove blanks depends on the pattern.
Is there any better way to write it?
Use positive lookarounds:
html = %| <a class=\"c-.......| # your line goes here
html.gsub(/(?<=>)\s+|\s+(?=<)/, '')
The above means “remove all whitespace after '>' or before '<'.”
Try Following:
html = "<a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\">\n <div class=\"c-flex\">\n <div class=\"c-grid__quotation--main\">\n <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" />\n </div>\n <div class=\"c-grid__quotation--side\">\n <div class=\"c-grid__quotation--side-title text--b\">\n Yahoo\n </div>\n <div class=\"c-grid__quotation--side-description\">\n News, email and search are just the beginning. Discover more every day. Find your yodel.\n </div>\n <div class=\"c-grid__quotation--side-url\">\n www.yahoo.com\n </div>\n </div>\n </div>\n</div></a>"
-> html.squeeze(' ').strip
output:
"<a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\"> <div class=\"c-flex\"> <div class=\"c-grid__quotation--main\"> <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" /> </div> <div class=\"c-grid__quotation--side\"> <div class=\"c-grid__quotation--side-title text--b\"> Yahoo </div> <div class=\"c-grid__quotation--side-description\"> News, email and search are just the beginning. Discover more every day. Find your yodel. </div> <div class=\"c-grid__quotation--side-url\"> www.yahoo.com </div> </div> </div> </div></a>"

How can I create a custom xpath query?

This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.

using variables in HtmlXPathSelectors

I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]').
I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in.
Update:
So for the sample text below I want to extract the div where id="footer":
<div id="footer">
<div id="footer-menu">
<div class="region-footer-menu">
<div id="block-menu-menu-footer-menu" class="block-menu">
<div class="content">
<ul class="menu">
<li class="first leaf">FAQs</li>
<li class="leaf">Media</li>
<li class="leaf">Partners</li>
<li class="last leaf active-trail">Jobs</li>
</ul>
</div>
</div>
<div id="block-block-52" class="block block-block">
<div class="content">
<p>SUPPORT</p>
</div>
</div>
</div>
</div>
</div>
We initialize hxs = HtmlXPathSelector(response) for all the below segments.
The following code selects only the first div:
hxs.select('//div[#id=concat("foot","er")]')
This code selects nothing but gives no error:
hxs.select('//div[#id="foot"+"er"]')
Both of the below code segments select nothing and give no errors:
long_string = "foot"
hxs.select('//div[#id=concat(long_string,"er")]')
hxs.select('//div[#id=long_string]')
I would like to be able to do either of the bottom two methods and return the desired results.
Assuming + works for string concatenation in Scrapy, this should work:
hxs.select('//div[#id="' + long_string + '"]')
I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.
have you tried?
hxs.select('\\div[#id="' + long_string_variable + '"]')

Resources