using variables in HtmlXPathSelectors - xpath

I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]').
I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in.
Update:
So for the sample text below I want to extract the div where id="footer":
<div id="footer">
<div id="footer-menu">
<div class="region-footer-menu">
<div id="block-menu-menu-footer-menu" class="block-menu">
<div class="content">
<ul class="menu">
<li class="first leaf">FAQs</li>
<li class="leaf">Media</li>
<li class="leaf">Partners</li>
<li class="last leaf active-trail">Jobs</li>
</ul>
</div>
</div>
<div id="block-block-52" class="block block-block">
<div class="content">
<p>SUPPORT</p>
</div>
</div>
</div>
</div>
</div>
We initialize hxs = HtmlXPathSelector(response) for all the below segments.
The following code selects only the first div:
hxs.select('//div[#id=concat("foot","er")]')
This code selects nothing but gives no error:
hxs.select('//div[#id="foot"+"er"]')
Both of the below code segments select nothing and give no errors:
long_string = "foot"
hxs.select('//div[#id=concat(long_string,"er")]')
hxs.select('//div[#id=long_string]')
I would like to be able to do either of the bottom two methods and return the desired results.

Assuming + works for string concatenation in Scrapy, this should work:
hxs.select('//div[#id="' + long_string + '"]')
I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.

have you tried?
hxs.select('\\div[#id="' + long_string_variable + '"]')

Related

Issues with preceding sibiling/parent/ancestor

<div class='productHolder'>
<a href="https://ap.com" class="tea-time-with-ap">
<div class="aptime-8" dataInfo="name">Hammer</div>
<div class="aptime-9" dataInfo="price">$980</div>
</div>
</div>
</a>
</div>
Note: there are over 20 productHolder classes on the same page.
I am able to get the price data, how can i used parent or preceding sibling to get the href.
I use the following code to get price:
rawPrice = response.xpath("//*[contains(text(),'$')]/text()")[counter].extract()
I've spent 2 hours trying to use preceding sibling, parent and even changing the code to use other values but, I run issues elsewhere.
Any help is appreciated, cheers!
Were you looking for something like:
from io import StringIO
from lxml import etree
html = """
<div class='productHolder'>
<a href="https://ap.com" class="tea-time-with-ap">
<div class="aptime-8" dataInfo="name">Hammer</div>
<div class="aptime-9" dataInfo="price">$980</div>
</div>
</div>
</a>
</div>
"""
root = etree.parse(StringIO(html), etree.HTMLParser())
print(root.xpath('//*[contains(text(),"$")]/../#href')[0])
Result:
https://ap.com
Of course you can easily build from this:
item = root.xpath('//*[contains(text(),"$")]')
print(item[0].text)
print(item[0].xpath('../#href')[0])
Result:
$980
https://ap.com

Parsing through response created with XPath

Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?
Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()

Select visible xpath in list

I am trying to get the error message off of a page from a site. The list contains several possible errors so i can't check by id; but I do know that the one with display:list-item is the one I want. This is my rule but doesn't seem to work, what is wrong with it? What I want returned is the error text in the element.
//*[#id='errors']/ul/li[contains(#style,'display:list-item')]
Example dom elements:
<div id="errors" class="some class" style="display: block;">
<div class="some other class"></div>
<div class="some other class 2">
<span class="displayError">Please correct the errors listed in red below:</span>
<ul>
<li style="display:none;" id="invalidId">Enter a valid id</li>
<li style="display:list-item;" id="genericError">Something bad happened</li>
<li style="display:none;" id="somethingBlah" ............ </li>
....
</ul>
</div>
The correct XPath should be:
//*[#id='errors']//ul/li[contains(#style,'display:list-item')]
After //*[#id='errors'] you need an extra /, because <ul> is not directly beneath it. Using // again scans all underlying elements for <ul>.
If you are capable to not use // it would be better and faster and less consuming.

Creating a Hovercard with jQuery Hovercard and custom data attributes

I need some help about creating more HoverCards with this plugin (download). I just created a code demo on JSFiddle. Do you have any recommendations about this?
JavaScript:
$('.babe-hover').hovercard({
detailsHTML: $(this).attr('data-control').html(),
width:278
});
HTML:
<ul class="demo">
<li>
<span class="babe-hover" data-control="control-01">William Johnson</span>
<div id="control-01" style="display: none;">
<p class="s-desc">Address: 64 Newman Street.</p>
<ul class="s-stats">
<li>Tweets<br><span class="s-count">1337</span></li>
</ul>
</div>
</li>
<li>
<span class="babe-hover" data-control="control-02">Hanson Thomas</span>
<div id="control-02" style="display: none;">
<p class="s-desc">Address: 64 Newman Street.</p>
<ul class="s-stats">
<li>Tweets<br><span class="s-count">1337</span></li>
</ul>
</div>
</li>
​
Now it working with some help from the author
$('.babe-hover').each(function(){
var $this = $(this),
myControlId = $this.attr('data-control'),
htmlForHovercard = $('#'+ myControlId).html();
$this.hovercard({
detailsHTML: htmlForHovercard,
width:278
});
});
anyway thank #egasimus for ur suggest :)
I suppose you're asking: why doesn't this work?
You're trying to call the .html() method on what $(this).attr('data-control') returns. $(this).attr('data-control'), however, only returns a string, and you need to obtain the corresponding element in order to use .html(). The following code works for me:
$("#" + $(this).attr('data-control')).html()
I.e., "select the element whose id equals this element's data-control attribute, and call .html() on it."

Accessing a div element in an array of li elements

I am trying to access a div in an li array
<ul>
<li class="views-row views-row-1 views-row-odd views-row-first">
<div class="news-item">
</li>
<li class="views-row views-row-2 views-row-even">
<li class="views-row views-row-3 views-row-odd">
<div class="news-item">
<div class="image">
<div class="details with-image">
<h2>
<p class="standfirst">The best two-seat </p>
<div class="meta">
<div class="pub-date">26 April 2012</div>
<div class="topic-bar clearfix">
<div class="topic car_review">review</div>
</div>
</div>
</div>
</div>
</li>
I am trying to access the "div class="topic car_review">car review "and get its text.
The reason I am specifically using that text is that, depending on what the text is it would enter specific steps.
Code that I am using is
#topic = #browser.li(:class => /views-row-#{x}/).div(:class,'news-item').div(:class,'details').div(:class,'meta').div(:class,/topic /).text
The script was working fine before and suddenly it has stopped working and is just not able to get the div(:class,'news-item').
The error message I get is
unable to locate element, using {:class=>"news-item", :tag_name=>"div"} (Watir::Exception::UnknownObjectException)
I tried div(:class => /news-/) but still its just not able to find that element
I am really stuck!!!
I assume that when you are doing li(:class => /views-row-#{x}/), the x means you are iterating over all rows? If so, then your script will fail on the row-2 since it does not contain the news-item div (resulting in the error that you see).
If there is only one of these 'topic car_review' div tags, you can just do:
#topic = #browser.div(:class, 'topic car_review')
Update - Iterating over each LI:
If you need to iterate over each LI, then you could do:
#browser.lis.each do |li|
#topic = li.div(:class, 'topic car_review').text
end

Resources