Using Scrapy and xpath to extract text - xpath

I'm trying to use xpath to extract text from the following html:
<p class="event-meta" xpath="1">Nanizanka / <span itemprop="genre">Akcija</span>,
<span itemprop="partOfSeason" itemscope="" itemtype="http://schema.org/CreativeWorkSeason">
<span itemprop="seasonNumber">8</span>. sezona,
</span>
<span itemprop="episodeNumber">9</span>. del,
United states of America
<br><i class="fa fa-clock-o"></i> <span>
51
</span> min |
IMDB: 7,3 |
<span>★</span>
<span>★</span>
<span>★</span>
<span>★</span>
<span>★</span>
<span>★</span>
<span>★</span>
<span class="hollow-star">★</span>
<span class="hollow-star">★</span>
<span class="hollow-star">★</span>
</p>
I'm having a problem extracting United States of America and IMDB score, since they don't have any tags?
I can't get beyond
response.xpath("//div[#class='row nogutter article']/div[#class='col-10']/main/article/p[#class='event-meta']//text()").extract()
since I need just the country and IMDB score as two separate items.
Your help will be much appreciated.

score = response.xpath('//text()[contains(., "IMDB:")]').re_first(r'IMDB:\s*(\S+)')
country = response.xpath('//span[#itemprop][last()]/following-sibling::text()').get()

Related

Proper xpath Syntax for Extracting Two Text Values

I am trying to scrape a web page for NAME OF COMPANY and CITY AND STATE OF COMPANY shown below.
I have an xpath code snippet that identifies both text elements at the same time:
// span[starts-with(#class,"text-align")]/text()[2]
This xpath snippet pulls the first text value (COMPANY NAME). How do I get the second text element (CITY,STATE)?
A snip of the web page code looks like this:
<div>
<ul class="pv-top-card-v3--experience-list">
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="position_see_more" data-ember-action="" data-ember-action-172="172">
<img src="https://media.licdn.com/dms/image/C4E0BAQFhA8h46hvabA/company-logo_100_100/0?e=1582761600&v=beta&t=VAeZqaGu3Lu6Ol_n5kiiI74FSRuSOZA1ggAI5qTVRjE" id="ember173" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember174" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE NAME OF A COMPANY
<!----></span>
</a>
</li>
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="education_see_more" data-ember-action="" data-ember-action-176="176">
<img src="https://media.licdn.com/dms/image/C560BAQEr2uQX-x2EwQ/company-logo_100_100/0?e=1582761600&v=beta&t=aDbYLUDMvlS4DpwOLjOaQj3Dj60C_cYLC5UUvGoyld0" id="ember177" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember178" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE CITY AND STATE OF COMPANY
<!----></span>
</a>
</li>
</ul>
</div>
The xpath string is picking up the two span elements using class. I can't use the span id attributes because they are dynamic and change with each page (one page per company).
Can someone advise how I extract the desired text?
Thanks.
point to the li level.
//ul/li[2]/a/span[starts-with(#class,"text-align")]

How to use preceding sibling in xpath

Here i want to Accept button. Here is the HTML.
<div class="friend-request no-pad ng-scope" ng-if="notifications.friendInvites.length > 0">
<p class="rem-head mzero small">
<div class="reminder-lst lst-box ng-scope" ng-repeat="friendInvite in notifications.friendInvites | limitTo:limit">
<span class="img-frame img-circle">
<span class="pull-left rem-detail-a">
<a class="pull-left rem-detail-a pzero" href="friend#/friends/friendprofile/b6c70e4f-bfe1-440d-836c-2e8fdc88540e">
<span class="frndact pull-right">
<a class="ignore" ng-click="ignoreNotification(friendInvite, 'friend')" href="javascript:void(0)">
<a class="accept" ng-click="acceptNotification(friendInvite, 'friend')" href="javascript:void(0)">
<i class="fa fa-lg fa-check-circle green"></i>
</a>
I have tried using below xpath but not working. Can anyone plz help me?
#FindBy(xpath=".//a[ng-click='acceptNotification(friendInvite, 'friend')']/preceding-sibling::i[#css='.fa.fa-lg.fa-check-circle.green']").
Thanks in advance
Assuming that you are looking for the 'A' tag of class accept, you can try
//i[#class="fa fa-lg fa-check-circle green"]/preceding-sibling::a[#class="accept"]
or
//i[#class="fa fa-lg fa-check-circle green"]/preceding-sibling::a[#ng-click="acceptNotification(friendInvite, 'friend')"]
a couple of things:
as TT noted your xpath was missing the # for the attribute selector
the sample you posted is not a well formed xml, expect troubles with xpath if you don't have an xhtml compliant source.
if you use the second example mind to escape either the " or the ' quotes, if you use it inside another expression

import.io selecting css class with xpath that contain certain value/character

<div id="mDetails">
<span class="textLabel">Bar Number:</span>
<p class="profileText">YYYYYYYYYYYYYYYYYYYY</p>
<span class="textLabel">Address:</span>
<p class="profileText">YYYYYYYYYYYYYYYYYYY<br>YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY<br>United States</p>
<span class="textLabel">Phone:</span>
<p class="profileText">123465798</p>
<span class="textLabel">Fax:</span>
<p class="profileText">987654321</p>
<span class="textLabel">Email:</span>
<p class="profileText">regina#rbr3.com</p>
<span class="textLabel">County:</span>
<p class="profileText">YYYYYYYYYYYYYYY</p>
<span class="textLabel">Circuit:</span>
<p class="profileText">YYYYYYYYYY</p>
<span class="textLabel">Admitted:</span>
<p class="profileText">00/00/0000</p>
<span class="textLabel">History:</span>
<p class="profileText">YYYYYYYYYYYYYYYYY</p>
im trying to select the email only if its available cause when i use //*[#class="profileText"]it returns everything with this class , i want only to return when # is present in the value.
With the adjustment to the input XML to change both <br> to <br/> (otherwise it's not valid XML) the following XPath selects all p elements that have the class profileText and contains #:
//p[#class='profileText'][contains(.,'#')]
returns
<p class="profileText">regina#rbr3.com</p>
In case you only want to get the value, you can use string():
string(//p[#class='profileText'][contains(.,'#')])
returns
regina#rbr3.com
Note that string() would only return the value of the first match, while the first XPath returning the p elements returns all matches.

How can I create a custom xpath query?

This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.

How can I get several similar tags data with HtmlAgilityPack?

Before explaining, I am using VB.net and HtmlAgilityPack.
I have the below html, all three sections have the same format. I am using htmlagilitypack to extract the data from the Title and Date. My code extracts the title correctly but the date is only extracted from the first instance and repeated 3 times:
HtmlAgilityPack code:
For Each h4 As HtmlNode In docnews.DocumentNode.SelectNodes("//h4[(#class='title')]")
Dim date1 As HtmlNode = docnews.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'date ')]")
Dim newsdate As String = date1.InnerText
MessageBox.Show(h4.InnerText)
MessageBox.Show(newsdate)
Next
I thought being in each h4, I get its associated date accordingly...
HTML code:
<div class="article-header" style="" data-itemid="920729" data-source="ABC" data-preview="Text 1">
<h4 class="title">Text for Mr. A</h4>
<div class="byline">
<span class="date timestamp"><span title="29 November 2013">29-11-2013</span></span>
<span class="source" title="AGE">18</span>
</div>
<div class="preview">Text 1 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920720" data-source="ABC" data-preview="Text 2">
<h4 class="title">Text for Mr. B</h4>
<div class="byline">
<span class="date timestamp"><span title="27 November 2013">27-11-2013</span></span>
<span class="source" title="AGE">25</span>
</div>
<div class="preview">Text 2 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920719" data-source="ABC" data-pre+view="Text 3">
<h4 class="title">Text for Mr. C</h4>
<div class="byline">
<span class="date timestamp"><span title="22 October 2013">22-10-2013</span></span>
<span class="source" title="AGE">20</span>
</div>
<div class="preview">Text 3 Preview</div>
</div>
Final Output should be:
Text for Mr. A
29-11-2013
Text for Mr. B
27-11-2013
Text for Mr. C
22-10-2013
What I am getting with my code:
Text for Mr. A
29-11-2013
Text for Mr. B
29-11-2013
Text for Mr. C
29-11-2013
Any help is much appreciated.
You need to anchor your second XPath to look 'below' the h4:
Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(#class, 'date ')]")
^^^^^^^^^ ^^^
The .// tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode on the h4.Parent you get the date below the parent div tag of the h4.

Resources