xpath how to extract the href attribute value based on the content?

xpath how to extract the href attribute value based on the content? - xpath

<a class="tab-link"> hoho </a>
<a class="tab-link" href="screener.ashx?v=110&s=ta_topgainers&r=41"><b>next</b></a>
<a class="tab-link"> hohaao </a>
I'm working on the scrapy
how can I grap the href attribute value based on the content "next" in xpath?

Check the b element's text inside this way:
//a[b = "next"]/#href
Demo from the Scrapy Shell:
$ cat index.html
<div>
<a class="tab-link"> hoho </a>
<a class="tab-link" href="screener.ashx?v=110&s=ta_topgainers&r=41"><b>next</b></a>
<a class="tab-link"> hohaao </a>
</div>
$ scrapy shell file:////path/to/index.html
>>> response.xpath('//a[b = "next"]/#href').extract_first()
u'screener.ashx?v=110&s=ta_topgainers&r=41'

Related

Proper xpath Syntax for Extracting Two Text Values

I am trying to scrape a web page for NAME OF COMPANY and CITY AND STATE OF COMPANY shown below.
I have an xpath code snippet that identifies both text elements at the same time:
// span[starts-with(#class,"text-align")]/text()[2]
This xpath snippet pulls the first text value (COMPANY NAME). How do I get the second text element (CITY,STATE)?
A snip of the web page code looks like this:
<div>
<ul class="pv-top-card-v3--experience-list">
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="position_see_more" data-ember-action="" data-ember-action-172="172">
<img src="https://media.licdn.com/dms/image/C4E0BAQFhA8h46hvabA/company-logo_100_100/0?e=1582761600&v=beta&t=VAeZqaGu3Lu6Ol_n5kiiI74FSRuSOZA1ggAI5qTVRjE" id="ember173" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember174" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE NAME OF A COMPANY
<!----></span>
</a>
</li>
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="education_see_more" data-ember-action="" data-ember-action-176="176">
<img src="https://media.licdn.com/dms/image/C560BAQEr2uQX-x2EwQ/company-logo_100_100/0?e=1582761600&v=beta&t=aDbYLUDMvlS4DpwOLjOaQj3Dj60C_cYLC5UUvGoyld0" id="ember177" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember178" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE CITY AND STATE OF COMPANY
<!----></span>
</a>
</li>
</ul>
</div>
The xpath string is picking up the two span elements using class. I can't use the span id attributes because they are dynamic and change with each page (one page per company).
Can someone advise how I extract the desired text?
Thanks.

point to the li level.
//ul/li[2]/a/span[starts-with(#class,"text-align")]

Extract text inside anchor tag using xpath

I am trying to ascertain how many pages are there for any search result on a site so that i can scrape data for all the pages using lxml and xpath.
There is a pagination tab with the following structure:
Page: 1 2 3 ... 7 next
the html content for the same being something like
<ul class="ulclass">
<li></li>
<li>
<span> You are on the first page</span>
"1"
</li>
<li>
<a href="link to second page">
<span></span>
"2"
</a>
</li>
<li>
</li>
...
<li>
<a href="link to last page">
<span></span>
"7"
</a>
</li>
My approach is to extract the page numbers 1,2,3,7 so that i can repeat the web scraping 7 times for every page 'cause otherwise it just scrapes the first result of the page.
I have written the following xpath, but it doesnot return correct page numbers.
xpath('//ul[#class="ulclass"]/li/a/text())

If I expand your example to form this,
<ul class="ulclass">
<li><span>You are on the first page</span>"1"</li>
<li><span></span>"2"</li>
<li><span></span>"3"</li>
<li><span></span>"4"</li>
<li><span></span>"5"</li>
<li><span></span>"6"</li>
<li><span></span>"7"</li>
</ul>
then using scrapy in Python I can get this:
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('..//ul[#class="ulclass"]/li/a/text()').extract()
['"2"', '"3"', '"4"', '"5"', '"6"', '"7"']

xPath expression is not finding text inside HTML

I have following XML:
<div>
<ul>
<li>
<a>
Logout 1
</a>
</li>
<li>
<a>
Logout 2
</a>
</li>
<li>
<a>
Logout 3
</a>
</li>
<li>
<a>
Logout 4
</a>
</li>
</ul>
</div>
And I want to check if a a tag with the text Logout 4exists. I do this with the following expression:
/div/ul/li/a[text() = 'Logout 4']
Which doesnt seem to work, anyone can tell me what I am doing wrong?
I am testing my xPath on this site btw: http://www.xpathtester.com/xpath

Your XPath didn't return any result because the inner text of the a element has leading and trailing spaces, which you can clear using normalize-space() :
/div/ul/li/a[normalize-space() = 'Logout 4']
demo
or, if you really want to evaluate only the first child text node within a :
/div/ul/li/a[normalize-space(text()) = 'Logout 4']

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.

You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?

This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end

To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

Scraping a web page by using XPath

I'm scraping some web pages in order to get some information. I'm using Scrapy and XPath language.
This is an example of page I would get. In the page there are many of this li element
<li ckIgnore="false" codmod="3857" ccar="A" area="NEW" versArea="NEW" shorturl="1" modurl="/auto">
<article>
<img width="210" height="158" src="" alt="" modello=>
<img src="" alt="logo" class="logo-listing" width="38">
<div class="hgroup">
<a href="">
<h5>ABARTH</h5>
<h3>500 cabrio</h3>
</a>
</div>
</article>
</li>
I'm using this syntax to get all the divs which have hgroup class. Unfortunately when I try to print print out models variable this is empty.
def parse(self, response):
sel = Selector(response)
models = sel.xpath("//div[#class='hgroup']/a")

It's possible that what scrapy "sees" is different than what you see on your browser. Try using scrapy shell "http://example.com" and check the response.body if what your looking for is there.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

xpath how to extract the href attribute value based on the content? - xpath

<a class="tab-link"> hoho </a> <a class="tab-link" href="screener.ashx?v=110&s=ta_topgainers&r=41"><b>next</b></a> <a class="tab-link"> hohaao </a> I'm working on the scrapy how can I grap the href attribute value based on the content "next" in xpath?

Related

Proper xpath Syntax for Extracting Two Text Values

Extract text inside anchor tag using xpath

xPath expression is not finding text inside HTML

How can I extract URLs from HTML content with a Ruby regexp?

Scraping a web page by using XPath

Categories

Resources