Scrapy and XPath issue with nested Xpaths - xpath
I'm trying to read Amazon products into scrapy.
Starting from a random category using this XPath:
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
item['url'] = product.xpath('//a[#class="s-access-detail-page"]/#href').extract()[0]
yield item
('//div[#class="s-item-container"]') returns all the divs with the products on one category page - that's correct.
Now, how would I get the link to the product?
// stands for where ever in the code
a with the #class should select the right class
But I get a:
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
exceptions.IndexError: list index out of range
So my list matching this XPath must be empty - but I don't understand why?
EDIT:
The HTML would look like that:
<div class="s-item-container" style="height: 343px;">
<div class="a-row a-spacing-base">
<div class="a-column a-span12 a-text-left">
<div class="a-section a-spacing-none a-inline-block s-position-relative">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer"><img alt="Product Details" src="http://ecx.images-amazon.com/images/I/41%2BzrAY74UL._AA160_.jpg" onload="viewCompleteImageLoaded(this, new Date().getTime(), 24, false);" class="s-access-image cfMarker" height="160" width="160"></a>
<div class="a-section a-spacing-none a-text-center">
<div class="a-row a-spacing-top-mini">
<a class="a-size-mini a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<div class="a-box">
<div class="a-box-inner a-padding-mini"><span class="a-color-secondary">See more choices</span></div>
</div>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none">
<a class="a-link-normal s-access-detail-page a-text-normal" title="Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<h2 class="a-size-base a-color-null s-inline s-access-title a-text-normal">Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)</h2>
</a>
</div>
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">Legend</span></div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="http://www.amazon.com/gp/offer-listing/B0105S434A/ref=sr_1_21_olp?s=pet-supplies&ie=UTF8&qid=1435391788&sr=1-21&keywords=pet+supplies&condition=new"><span class="a-size-base a-color-price a-text-bold">$28.99</span><span class="a-letter-space"></span>new<span class="a-letter-space"></span><span class="a-color-secondary">(1 offer)</span><span class="a-letter-space"></span><span class="a-color-secondary a-text-strike"></span></a></div>
</div>
<div class="a-row a-spacing-none"><span name="B0105S434A">
<span class="a-declarative" data-action="a-popover" data-a-popover="{"max-width":"700","closeButton":"false","position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&asin=B0105S434A&contextId=search&ref=acr_search__popover"}"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">3.9 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></span></span>
<a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">48</a>
</div>
</div>
It should be:
# ------------- The dot makes the query relative to product
product.xpath('.//a[#class="s-access-detail-page"]/#title')
//a[#class="s-access-detail-page"] requires to be exactly class="s-access-detail-page", because xpath works with string but not with meaning :) When you have "multi class ", use contains function
//a[contains(concat(' ', #class, ' '), " s-access-detail-page ")]/#title
Related
HtmlAgilityPack SelectNodes InnerText returns placeholder text not actual value
in my app i want to scrape a web page to extract the values i am interesting. (ShopData is the HtmlNodeCollection) and my C# code like this : var ShopName = ShopData.SelectNodes(".//div[#class='shop-name']"); this return null if i try this, returns the node: var ShopName1 = ShopData.SelectNodes(".//div[contains(#class, 'shop cf')]") why .//div[#class='shop-name'] does not work? if i do ShopData.SelectNodes(".//div[contains(#class, 'shop cf')]").ToList()[0]; then the innertext is empty. the same time ShopData.SelectNodes(".//div[#class='price']").ToList()[0].InnerText return text normally. what are the difference between this 2 functions? my web page looks like this: <li class="cf card js-product-card"> <div class="shop cf"> <div class="shop-logo js-shop-logo"> <img class="fade-in" src="//a.scdn.gr/ds/shops/logos/2870/mid_20181210114648_f305ba08.jpeg" data-src="//a.scdn.gr/ds/shops/logos/2870/mid_20181210114648_f305ba08.jpeg" alt="Electroholic"> </div> <i class="icon tooltip-parent js-tooltip-handler trustmark" data-trigger="toggle" data-type="string" data-theme="light" data-content="Το κατάστημα διαθέτει πιστοποίηση GRECA Trustmark που σημαίνει ότι έχει δεσμευτεί να εργαστεί σύμφωνα με τον Eλληνικό και Ευρωπαϊκό (αντίστοιχα) Κώδικα Ηλεκτρονικού Εμπορίου, διασφαλίζοντας δεοντολογικά πρότυπα στην ψηφιακή αγορά.<div>Περισσότερες πληροφορίες στην <a href='http://www.greekecommerce.gr/' target='_blank'>ιστοσελίδα του GRECA.</a></div>" data-placement="left"> <span>GRECA Trustmark</span> </i> <div class="shop-name">Electroholic</div> </div> <div class="description"> <div class="item"> <h3> <a title="Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις" rel="nofollow" class="js-product-link content-placeholder" data-type="title" href="/products/show/32755241"> Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις</a> </h3> <p class="availability"><span class="availability">Παράδοση έως 30 ημέρες</span></p> </div> </div> <div class="price"> <div class=""> <div class="price-content"><a title="Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις" rel="nofollow" class="js-product-link product-link content-placeholder" data-type="net_price" href="/products/show/32755241">358,00 €</a><span class="extra-cost cf"><em>+ 9,00 €</em> <span>Μεταφορικά</span></span><span class="extra-cost cf"><em>+ 2,00 €</em> <span>Αντικαταβολή</span></span><span class="final-price"><a title="Πολυμηχάνημα Epson EcoTank ITS L6170 WiFi ink - έως 60 δόσεις" rel="nofollow" class="js-product-link content-placeholder" data-type="final_price" href="/products/show/32755241">369,00 €</a></span></div> </div> </div> <div class="shop-details react-expander-bottom js-product-uservoice"><span class="payment-options"><i class="icon tooltip-parent js-tooltip-handler trustmark" data-trigger="toggle" data-type="string" data-theme="light" data-content="Το κατάστημα διαθέτει πιστοποίηση GRECA Trustmark που σημαίνει ότι έχει δεσμευτεί να εργαστεί σύμφωνα με τον Eλληνικό και Ευρωπαϊκό (αντίστοιχα) Κώδικα Ηλεκτρονικού Εμπορίου, διασφαλίζοντας δεοντολογικά πρότυπα στην ψηφιακή αγορά.<div>Περισσότερες πληροφορίες στην <a href='http://www.greekecommerce.gr/' target='_blank'>ιστοσελίδα του GRECA.</a></div>" data-placement="auto vertical"><span>GRECA Trustmark</span></i> </span> <div class="shop-expander-tabs"> <button class="shop-tab js-shop-tab icon "> <div class="rating-with-count react-component"> <a class="rating stars" title="3,9 αστέρια από 1493 χρήστες" href="#reviews"> <div class="rating-wrapper"> <div class="actual-rating blue" itemprop="" style="width: 78%;">1493</div><span itemprop="">3,9</span></div> </a> <div class="reviews-count blue"> <a title="1493 αξιολογήσεις χρηστών" href="#reviews">1493</a></div> </div> </button> <button class="shop-tab js-shop-tab icon location-tab multi-shops "> <span>Περιστέρι, Αττική</span></button> </div> <div class="shop-info-object js-shop-info-expander "> </div> </div> </li>
OK, i figure out what is goin on. the web page use AJAX calls and thats why i cannot see them.
KnockoutJS elements not rendered once loaded via Jquery Ajax function
I have loaded a sidebar over ajax however this html uses knockoutJS to render completely. I am wondering how to execute the KnockoutJs portions of this code. The content below is loaded via jQuery ajax function and contains a number of knockout elements as well as some X Magento Init type scripts: <div class=\"block filter\" id=\"layered-filter-block\" data-mage-init='{\"collapsible\":{\"openedState\": \"active\", \"collapsible\": true, \"active\": false, \"collateral\": { \"openedState\": \"filter-active\", \"element\": \"body\" } }}'> <div class=\"block-title filter-title\" data-count=\"0\"> <strong data-role=\"title\">Shop By<\/strong> <\/div> <div class=\"block-content filter-content\"> <strong role=\"heading\" aria-level=\"2\" class=\"block-subtitle filter-subtitle\">Shopping Options<\/strong> <div class=\"filter-options\" id=\"narrow-by-list\" data-role=\"content\" data-mage-init='{\"accordion\":{\"openedState\": \"active\", \"collapsible\": true, \"active\": [0,1,2], \"multipleCollapsible\": true}}'> <div data-role=\"collapsible\" class=\"filter-options-item\"> <div data-role=\"title\" class=\"filter-options-title\">Category<\/div> <div data-role=\"content\" class=\"filter-options-content\">\n<ol class=\"items\"> <li class=\"item\"> <a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=143&q=ice+machine\">Front of House <span class=\"count\">2<span class=\"filter-count-label\">items<\/span><\/span><\/a> <\/li> <li class=\"item\"> <a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=182&q=ice+machine\">Bar Supplies <span class=\"count\">4<span class=\"filter-count-label\">items<\/span><\/span><\/a> <\/li> <li class=\"item\"> <a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=257&q=ice+machine\">Catering Equipment<span class=\"count\">111<span class=\"filter-count-label\">\n items <\/span><\/span>\n <\/a>\n <\/li>\n <li class=\"item\">\n <a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=342&q=ice+machine\">\n Warewashing <span class=\"count\">\n 3 <span class=\"filter-count-label\">\n items <\/span><\/span>\n <\/a>\n <\/li>\n <li class=\"item\">\n <a href=\"http:\/\/www.domain.com\/catalogsearch\/result\/index\/?ajax=1&cat=521&q=ice+machine\">\n Catering Equipment Offers <span class=\"count\">\n 1 <span class=\"filter-count-label\">\n item <\/span><\/span>\n <\/a>\n <\/li>\ <\/ol> <\/div>\n <\/div>\n <div data-role=\"collapsible\" class=\"filter-options-item\"> <div data-role=\"title\" class=\"filter-options-title\">Brand<\/div>\n <div data-role=\"content\" class=\"filter-options-content\"> <div data-bind=\"scope: 'brandFilter'\"> <!-- ko template: getTemplate() --> <!-- \/ko --> <\/div> <script type=\"text\/x-magento-init\"> {\"*\" : {\"Magento_Ui\/js\/core\/app\": {\"components\": {\"brandFilter\": {\"component\":\"Smile_ElasticsuiteCatalog\\\/js\\\/attribute-filter\",\"maxSize\":10,\"displayProductCount\":true,\"hasMoreItems\":true,\"ajaxLoadUrl\":\"http:\\\/\\\/www.domain.com\\\/catalog\\\/navigation_filter\\\/ajax\\\/?ajax=1&filterName=brand&q=ice+machine\",\"items\":[{\"label\":\"Scotsman\",\"count\":41,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Scotsman&q=ice+machine\",\"is_selected\":false},{\"label\":\"Hoshizaki\",\"count\":15,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Hoshizaki&q=ice+machine\",\"is_selected\":false},{\"label\":\"Ice-o-matic\",\"count\":12,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Ice-o-matic&q=ice+machine\",\"is_selected\":false},{\"label\":\"Blue Ice\",\"count\":7,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Blue+Ice&q=ice+machine\",\"is_selected\":false},{\"label\":\"Graupel\",\"count\":7,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Graupel&q=ice+machine\",\"is_selected\":false},{\"label\":\"Nemox\",\"count\":7,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Nemox&q=ice+machine\",\"is_selected\":false},{\"label\":\"Manitowoc\",\"count\":6,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Manitowoc&q=ice+machine\",\"is_selected\":false},{\"label\":\"Polar Refrigeration\",\"count\":5,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Polar+Refrigeration&q=ice+machine\",\"is_selected\":false},{\"label\":\"Longo & Co\",\"count\":4,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Longo+%26+Co&q=ice+machine\",\"is_selected\":false},{\"label\":\"Beaumont\",\"count\":3,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&brand=Beaumont&q=ice+machine\",\"is_selected\":false}]}}}}}\n<\/script>\n\n<\/div>\n <\/div>\n <div data-role=\"collapsible\" class=\"filter-options-item\">\n <div data-role=\"title\" class=\"filter-options-title\">Power<\/div>\n <div data-role=\"content\" class=\"filter-options-content\"><div data-bind=\"scope: 'power_ddFilter'\">\n <!-- ko template: getTemplate() --> <!-- \/ko -->\n<\/div>\n\n<script type=\"text\/x-magento-init\">\n {\"*\" : {\"Magento_Ui\/js\/core\/app\": {\"components\": {\"power_ddFilter\": {\"component\":\"Smile_ElasticsuiteCatalog\\\/js\\\/attribute-filter\",\"maxSize\":10,\"displayProductCount\":true,\"hasMoreItems\":false,\"ajaxLoadUrl\":\"http:\\\/\\\/www.domain.com\\\/catalog\\\/navigation_filter\\\/ajax\\\/?ajax=1&filterName=power_dd&q=ice+machine\",\"items\":[{\"label\":\"13 Amp (Plug)\",\"count\":111,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&power_dd=13+Amp+%28Plug%29&q=ice+machine\",\"is_selected\":false},{\"label\":\"1 Phase (Hard Wired)\",\"count\":2,\"url\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&power_dd=1+Phase+%28Hard+Wired%29&q=ice+machine\",\"is_selected\":false}]}}}}}\n<\/script>\n\n<\/div>\n <\/div>\n <div data-role=\"collapsible\" class=\"filter-options-item\">\n <div data-role=\"title\" class=\"filter-options-title\">Price<\/div>\n <div data-role=\"content\" class=\"filter-options-content\"><div class=\"smile-es-range-slider\" data-role=\"range-price-slider-price\">\n <div data-role=\"from-label\"><\/div>\n <div data-role=\"to-label\"><\/div>\n <div data-role=\"slider-bar\"><\/div>\n <div class=\"actions-toolbar\">\n <div data-role=\"message-box\"><\/div>\n <div class=\"actions-primary\">\n <a class=\"action primary small\" data-role=\"apply-range\">\n <span>OK<\/span>\n <\/a>\n <\/div>\n <\/div>\n<\/div>\n\n<script type=\"text\/x-magento-init\">\n { \"[data-role=range-price-slider-price]\" : { \"rangeSlider\" : {\"minValue\":1,\"maxValue\":6091,\"currentValue\":{\"from\":1,\"to\":6091},\"fieldFormat\":{\"pattern\":\"\\u00a3%s\",\"precision\":2,\"requiredPrecision\":2,\"decimalSymbol\":\".\",\"groupSymbol\":\",\",\"groupLength\":3,\"integerRequired\":false},\"intervals\":[{\"value\":1,\"count\":1},{\"value\":2,\"count\":1},{\"value\":3,\"count\":1},{\"value\":40,\"count\":1},{\"value\":60,\"count\":1},{\"value\":64,\"count\":1},{\"value\":150,\"count\":1},{\"value\":179,\"count\":1},{\"value\":190,\"count\":1},{\"value\":242,\"count\":1},{\"value\":291,\"count\":1},{\"value\":325,\"count\":1},{\"value\":355,\"count\":2},{\"value\":395,\"count\":1},{\"value\":465,\"count\":1},{\"value\":472,\"count\":1},{\"value\":515,\"count\":1},{\"value\":520,\"count\":1},{\"value\":535,\"count\":1},{\"value\":555,\"count\":1},{\"value\":577,\"count\":1},{\"value\":585,\"count\":1},{\"value\":599,\"count\":1},{\"value\":605,\"count\":2},{\"value\":615,\"count\":1},{\"value\":640,\"count\":1},{\"value\":658,\"count\":1},{\"value\":685,\"count\":1},{\"value\":705,\"count\":1},{\"value\":730,\"count\":1},{\"value\":745,\"count\":2},{\"value\":785,\"count\":1},{\"value\":805,\"count\":1},{\"value\":830,\"count\":1},{\"value\":895,\"count\":2},{\"value\":925,\"count\":1},{\"value\":965,\"count\":1},{\"value\":970,\"count\":1},{\"value\":990,\"count\":2},{\"value\":1030,\"count\":1},{\"value\":1065,\"count\":1},{\"value\":1080,\"count\":1},{\"value\":1085,\"count\":1},{\"value\":1095,\"count\":1},{\"value\":1105,\"count\":1},{\"value\":1130,\"count\":1},{\"value\":1155,\"count\":1},{\"value\":1225,\"count\":1},{\"value\":1235,\"count\":1},{\"value\":1240,\"count\":1},{\"value\":1259,\"count\":1},{\"value\":1310,\"count\":1},{\"value\":1360,\"count\":1},{\"value\":1365,\"count\":1},{\"value\":1450,\"count\":1},{\"value\":1485,\"count\":1},{\"value\":1495,\"count\":1},{\"value\":1510,\"count\":1},{\"value\":1580,\"count\":2},{\"value\":1605,\"count\":2},{\"value\":1685,\"count\":1},{\"value\":1710,\"count\":1},{\"value\":1779,\"count\":1},{\"value\":1785,\"count\":1},{\"value\":1865,\"count\":1},{\"value\":1870,\"count\":1},{\"value\":1885,\"count\":1},{\"value\":1890,\"count\":1},{\"value\":1970,\"count\":1},{\"value\":1995,\"count\":1},{\"value\":2000,\"count\":1},{\"value\":2050,\"count\":1},{\"value\":2130,\"count\":1},{\"value\":2199,\"count\":1},{\"value\":2220,\"count\":1},{\"value\":2345,\"count\":1},{\"value\":2350,\"count\":1},{\"value\":2360,\"count\":1},{\"value\":2405,\"count\":1},{\"value\":2415,\"count\":1},{\"value\":2445,\"count\":1},{\"value\":2450,\"count\":2},{\"value\":2480,\"count\":1},{\"value\":2500,\"count\":1},{\"value\":2530,\"count\":1},{\"value\":2565,\"count\":1},{\"value\":2570,\"count\":1},{\"value\":2595,\"count\":1},{\"value\":2695,\"count\":1},{\"value\":2730,\"count\":1},{\"value\":2825,\"count\":1},{\"value\":2850,\"count\":1},{\"value\":2950,\"count\":1},{\"value\":2995,\"count\":1},{\"value\":3010,\"count\":1},{\"value\":3025,\"count\":1},{\"value\":3145,\"count\":1},{\"value\":3205,\"count\":1},{\"value\":3295,\"count\":1},{\"value\":3300,\"count\":1},{\"value\":3485,\"count\":1},{\"value\":3495,\"count\":1},{\"value\":3580,\"count\":1},{\"value\":4015,\"count\":1},{\"value\":4075,\"count\":1},{\"value\":4305,\"count\":1},{\"value\":4310,\"count\":1},{\"value\":4595,\"count\":1},{\"value\":4620,\"count\":1},{\"value\":5250,\"count\":1},{\"value\":5355,\"count\":1},{\"value\":6090,\"count\":1}],\"urlTemplate\":\"http:\\\/\\\/www.domain.com\\\/catalogsearch\\\/result\\\/index\\\/?ajax=1&price=<%- from %>-<%- to %>&q=ice+machine\",\"messageTemplates\":{\"displayCount\":\"<%- count %> products\",\"displayEmpty\":\"No products in the selected range.\"},\"rate\":1} } } <\/script> <\/div> <\/div> <\/div> <\/div> <\/div> These are then added to a block on my page via html jQuery method: $(sidebarBlock).html(this.filters); Looking at the DOM I cannot actually see the scripts however they are there in response when reviewing with console.log(). Similarly the below shows the scripts are present: $(sidebar).find("script").each(function() { console.log("found a script"); } I have tried to use .trigger('contentUpdated'); like below: document.getElementById("layered-filter-block").innerHTML = this.filters; $(sidebarBlock).trigger('contentUpdated'); and: $(sidebarBlock).html(this.filters); $(sidebarBlock).trigger('contentUpdated'); and by reapplying bindings for knockout: ko.cleanNode($('#layered-filter-block')); ko.applyBindings($('#layered-filter-block')); The above throws an error about bindings already being applied however but I have used cleanNode before to unbind however error persists.
This fixed issue for me: $(sidebarBlock).applyBindings(); https://codeblog.experius.nl/magento-2-uicomponent-reinit-ajax-reload/
Remove leading or trailing whitespace of string except html tag in ruby
I want to remove leading or trailing whitespace of string except html tag example html = <a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\">\n <div class=\"c-flex\">\n <div class=\"c-grid__quotation--main\">\n <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" />\n </div>\n <div class=\"c-grid__quotation--side\">\n <div class=\"c-grid__quotation--side-title text--b\">\n Yahoo\n </div>\n <div class=\"c-grid__quotation--side-description\">\n News, email and search are just the beginning. Discover more every day. Find your yodel.\n </div>\n <div class=\"c-grid__quotation--side-url\">\n www.yahoo.com\n </div>\n </div>\n </div>\n</div></a> My way of Doing this html.gsub(/>\s{1,8}</, "><").gsub(/>\s{1,8}/, ">").gsub(/\s{1,8}</, "<") How to remove blanks depends on the pattern. Is there any better way to write it?
Use positive lookarounds: html = %| <a class=\"c-.......| # your line goes here html.gsub(/(?<=>)\s+|\s+(?=<)/, '') The above means “remove all whitespace after '>' or before '<'.”
Try Following: html = "<a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\">\n <div class=\"c-flex\">\n <div class=\"c-grid__quotation--main\">\n <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" />\n </div>\n <div class=\"c-grid__quotation--side\">\n <div class=\"c-grid__quotation--side-title text--b\">\n Yahoo\n </div>\n <div class=\"c-grid__quotation--side-description\">\n News, email and search are just the beginning. Discover more every day. Find your yodel.\n </div>\n <div class=\"c-grid__quotation--side-url\">\n www.yahoo.com\n </div>\n </div>\n </div>\n</div></a>" -> html.squeeze(' ').strip output: "<a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\"> <div class=\"c-flex\"> <div class=\"c-grid__quotation--main\"> <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" /> </div> <div class=\"c-grid__quotation--side\"> <div class=\"c-grid__quotation--side-title text--b\"> Yahoo </div> <div class=\"c-grid__quotation--side-description\"> News, email and search are just the beginning. Discover more every day. Find your yodel. </div> <div class=\"c-grid__quotation--side-url\"> www.yahoo.com </div> </div> </div> </div></a>"
How can I create a custom xpath query?
This is my HTML file data: <article class='course-box'> <div class='row-fluid'> <div class='span2'> <div class='course-cover' style='width: 100%'> <img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'> </div> </div> <div class='span10'> <h2 class='coursetitle'> <a href='https://novoed.com/hc'>Hippocrates Challenge</a> </h2> <figure class='pricetag'> Free </figure> <div class='timeline independent-text'> <div class='timeline inline-block'> Starting Spring 2014 </div> </div> By Jill Helms <div class='university' style='margin-top:0px; font-style:normal;'> Stanford University </div> </div> </div> <div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'"> <div class='span2'> <div class='course-cover'> <img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'> </div> </div> <div class='span10'> <h2 class='coursetitle' style='margin-top: 10px'> <a href='https://novoed.com/hc'> Hippocrates Challenge </a> </h2> <p class='description' style='width: 70%'> Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while... </p> <div style='margin-right: 10px'> <a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'> Sign Up </a> </div> </div> </div> from above the code i need to fetch the following tag class values. coursetitle coursetitle href link pircetag timeline inline-block uinversity description instructor name but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth. my xpath queries are: novoedData = HtmlXPathSelector(response) courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract() courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract() courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract() coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract() courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract() courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract() but the number of values in each list variable is difference: len(courseTitle) = 40 (two times because of repetition) len(courseDetailLink) = 40 (two times because of repetition) len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value) len(coursePriceType) = 20 (correct count no repetition) len(courseShortSummary)= 20 (correct count no repetition) len(courseUniversity) = 20 (correct count no repetition) kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do: >>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0] >>> len(s) 1 >>> s.xpath('h2[#class="coursetitle"]/a/#href').extract() [u'https://novoed.com/hc'] also note that once s is set on the right place you can just continue from it.
using variables in HtmlXPathSelectors
I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]'). I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in. Update: So for the sample text below I want to extract the div where id="footer": <div id="footer"> <div id="footer-menu"> <div class="region-footer-menu"> <div id="block-menu-menu-footer-menu" class="block-menu"> <div class="content"> <ul class="menu"> <li class="first leaf">FAQs</li> <li class="leaf">Media</li> <li class="leaf">Partners</li> <li class="last leaf active-trail">Jobs</li> </ul> </div> </div> <div id="block-block-52" class="block block-block"> <div class="content"> <p>SUPPORT</p> </div> </div> </div> </div> </div> We initialize hxs = HtmlXPathSelector(response) for all the below segments. The following code selects only the first div: hxs.select('//div[#id=concat("foot","er")]') This code selects nothing but gives no error: hxs.select('//div[#id="foot"+"er"]') Both of the below code segments select nothing and give no errors: long_string = "foot" hxs.select('//div[#id=concat(long_string,"er")]') hxs.select('//div[#id=long_string]') I would like to be able to do either of the bottom two methods and return the desired results.
Assuming + works for string concatenation in Scrapy, this should work: hxs.select('//div[#id="' + long_string + '"]') I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.
have you tried? hxs.select('\\div[#id="' + long_string_variable + '"]')