How can I create a custom xpath query? - xpath

This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..

you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.

Related

Thymeleaf switch block returns incorrect value

I have a switch block in my thymeleaf page where I show an image depending on the reputation score of the user:
<h1>
<span th:text="#{user.reputation} + ${reputation}">Reputation</span>
</h1>
<div th:if="${reputation lt 0}">
<img th:src="#{/css/img/troll.png}"/>
</div>
<div th:if="${reputation} == 0">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>
<div th:if="${reputation gt 0} and ${reputation le 5}">
<img th:src="#{/css/img/samwise.png}"/>
</div>
<div th:if="${reputation gt 5} and ${reputation le 15}">
<img th:src="#{/css/img/frodo.png}"/>
</div>
<div th:if="${reputation gt 15}">
<img th:src="#{/css/img/gandalf.jpg}"/>
</div>
This statement always returns smeagol (so reputation 0), eventhough the reputation of this user is 7: example
EDIT:
I was wrong, the image showing was a rogue line:
<!--<img th:src="#{/css/img/smeagol.jpg}"/>-->
but I commented it out. Now there is no image showing.
EDIT2:
changed my comparators (see original post) and now I get the following error:
The value of attribute "th:case" associated with an element type "div" must not contain the '<' character.
EDIT3:
Works now, updated original post to working code
According to the documentation, Thymeleaf's switch statement works just like Java's - and the example suggests the same.
In other words: you cannot do
<th:block th:switch="${reputation}">
<div th:case="${reputation} < 0">
[...]
but would need to do
<th:block th:switch="${reputation}">
<div th:case="0">
[...]
which is not what you want.
Instead, you will have to use th:if, i.e. something like this:
<div th:if="${reputation} < 0">
<img th:src="#{/css/img/troll.png}"/>
</div>
Change
<div th:case="0">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>
to
<div th:case="${reputation == 0}">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>

Scrapy and XPath issue with nested Xpaths

I'm trying to read Amazon products into scrapy.
Starting from a random category using this XPath:
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
item['url'] = product.xpath('//a[#class="s-access-detail-page"]/#href').extract()[0]
yield item
('//div[#class="s-item-container"]') returns all the divs with the products on one category page - that's correct.
Now, how would I get the link to the product?
// stands for where ever in the code
a with the #class should select the right class
But I get a:
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
exceptions.IndexError: list index out of range
So my list matching this XPath must be empty - but I don't understand why?
EDIT:
The HTML would look like that:
<div class="s-item-container" style="height: 343px;">
<div class="a-row a-spacing-base">
<div class="a-column a-span12 a-text-left">
<div class="a-section a-spacing-none a-inline-block s-position-relative">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer"><img alt="Product Details" src="http://ecx.images-amazon.com/images/I/41%2BzrAY74UL._AA160_.jpg" onload="viewCompleteImageLoaded(this, new Date().getTime(), 24, false);" class="s-access-image cfMarker" height="160" width="160"></a>
<div class="a-section a-spacing-none a-text-center">
<div class="a-row a-spacing-top-mini">
<a class="a-size-mini a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<div class="a-box">
<div class="a-box-inner a-padding-mini"><span class="a-color-secondary">See more choices</span></div>
</div>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none">
<a class="a-link-normal s-access-detail-page a-text-normal" title="Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<h2 class="a-size-base a-color-null s-inline s-access-title a-text-normal">Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)</h2>
</a>
</div>
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">Legend</span></div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="http://www.amazon.com/gp/offer-listing/B0105S434A/ref=sr_1_21_olp?s=pet-supplies&ie=UTF8&qid=1435391788&sr=1-21&keywords=pet+supplies&condition=new"><span class="a-size-base a-color-price a-text-bold">$28.99</span><span class="a-letter-space"></span>new<span class="a-letter-space"></span><span class="a-color-secondary">(1 offer)</span><span class="a-letter-space"></span><span class="a-color-secondary a-text-strike"></span></a></div>
</div>
<div class="a-row a-spacing-none"><span name="B0105S434A">
<span class="a-declarative" data-action="a-popover" data-a-popover="{"max-width":"700","closeButton":"false","position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&asin=B0105S434A&contextId=search&ref=acr_search__popover"}"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">3.9 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></span></span>
<a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">48</a>
</div>
</div>
It should be:
# ------------- The dot makes the query relative to product
product.xpath('.//a[#class="s-access-detail-page"]/#title')
//a[#class="s-access-detail-page"] requires to be exactly class="s-access-detail-page", because xpath works with string but not with meaning :) When you have "multi class ", use contains function
//a[contains(concat(' ', #class, ' '), " s-access-detail-page ")]/#title

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.
You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?
This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end
To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

How can I get several similar tags data with HtmlAgilityPack?

Before explaining, I am using VB.net and HtmlAgilityPack.
I have the below html, all three sections have the same format. I am using htmlagilitypack to extract the data from the Title and Date. My code extracts the title correctly but the date is only extracted from the first instance and repeated 3 times:
HtmlAgilityPack code:
For Each h4 As HtmlNode In docnews.DocumentNode.SelectNodes("//h4[(#class='title')]")
Dim date1 As HtmlNode = docnews.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'date ')]")
Dim newsdate As String = date1.InnerText
MessageBox.Show(h4.InnerText)
MessageBox.Show(newsdate)
Next
I thought being in each h4, I get its associated date accordingly...
HTML code:
<div class="article-header" style="" data-itemid="920729" data-source="ABC" data-preview="Text 1">
<h4 class="title">Text for Mr. A</h4>
<div class="byline">
<span class="date timestamp"><span title="29 November 2013">29-11-2013</span></span>
<span class="source" title="AGE">18</span>
</div>
<div class="preview">Text 1 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920720" data-source="ABC" data-preview="Text 2">
<h4 class="title">Text for Mr. B</h4>
<div class="byline">
<span class="date timestamp"><span title="27 November 2013">27-11-2013</span></span>
<span class="source" title="AGE">25</span>
</div>
<div class="preview">Text 2 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920719" data-source="ABC" data-pre+view="Text 3">
<h4 class="title">Text for Mr. C</h4>
<div class="byline">
<span class="date timestamp"><span title="22 October 2013">22-10-2013</span></span>
<span class="source" title="AGE">20</span>
</div>
<div class="preview">Text 3 Preview</div>
</div>
Final Output should be:
Text for Mr. A
29-11-2013
Text for Mr. B
27-11-2013
Text for Mr. C
22-10-2013
What I am getting with my code:
Text for Mr. A
29-11-2013
Text for Mr. B
29-11-2013
Text for Mr. C
29-11-2013
Any help is much appreciated.
You need to anchor your second XPath to look 'below' the h4:
Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(#class, 'date ')]")
^^^^^^^^^ ^^^
The .// tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode on the h4.Parent you get the date below the parent div tag of the h4.

using variables in HtmlXPathSelectors

I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]').
I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in.
Update:
So for the sample text below I want to extract the div where id="footer":
<div id="footer">
<div id="footer-menu">
<div class="region-footer-menu">
<div id="block-menu-menu-footer-menu" class="block-menu">
<div class="content">
<ul class="menu">
<li class="first leaf">FAQs</li>
<li class="leaf">Media</li>
<li class="leaf">Partners</li>
<li class="last leaf active-trail">Jobs</li>
</ul>
</div>
</div>
<div id="block-block-52" class="block block-block">
<div class="content">
<p>SUPPORT</p>
</div>
</div>
</div>
</div>
</div>
We initialize hxs = HtmlXPathSelector(response) for all the below segments.
The following code selects only the first div:
hxs.select('//div[#id=concat("foot","er")]')
This code selects nothing but gives no error:
hxs.select('//div[#id="foot"+"er"]')
Both of the below code segments select nothing and give no errors:
long_string = "foot"
hxs.select('//div[#id=concat(long_string,"er")]')
hxs.select('//div[#id=long_string]')
I would like to be able to do either of the bottom two methods and return the desired results.
Assuming + works for string concatenation in Scrapy, this should work:
hxs.select('//div[#id="' + long_string + '"]')
I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.
have you tried?
hxs.select('\\div[#id="' + long_string_variable + '"]')

Resources