extract data from a div that have no class using xpath - xpath

Code
<div id="content">
<div class="sample">sample text</div>
<div class="datebar">
<span style="float:right">some text1</span>
<b>some text2</b>
</div>
<p>paragraph 1</p>
<p>paragraph 2</p>
</div>
I want to get data that in <p> tags or you can say that is coming after <div class="datebar">.

//div[#id="content"]/p/text()
Would achieve what you're asking for with your provided sample.
Update
If you only wanted those <p> that came after <div class="datebar">. The following should work:
//div[#id = 'content']/p[preceding-sibling::div[#class='datebar']]/text()
Another Update - For Kirill
Here's a sample of HTML which has an extra <p> before <div class="datebar"> and xpath expressions tested using python.
Obviously, the solution depends on what the full input HTML is and what the OP wants to extract, neither of which are clear at the moment.
>>> from lxml import etree
>>> doc = etree.HTML("""
... <div id="content">
... <div class="sample">sample text</div>
... <p>paragraph 1</p>
... <div class="datebar">
... <span style="float:right">some text1</span>
... <b>some text2</b>
... </div>
... <p>paragraph 2</p>
... <p>paragraph 3</p>
... </div>""")
>>> # My first suggestion
... doc.xpath("//div[#id='content']/p/text()")
['paragraph 1', 'paragraph 2', 'paragraph 3']
>>> # Kirill's solution
... doc.xpath("//div[#id = 'content' and div[#class = 'datebar']]/p/text()")
['paragraph 1', 'paragraph 2', 'paragraph 3']
>>> # My response to Kirill
... doc.xpath("//div[#id = 'content']/p[preceding-sibling::div[#class='datebar']]/text()")
['paragraph 2', 'paragraph 3']
Kirill's expression of //div[#id = 'content' and div[#class = 'datebar']]/p/text() does not select
only those p which parent div has #id = 'content' and have preceding div with #class = 'datebar'
As stated in his comments.

//div[#id = 'content' and div[#class = 'datebar']]/p/text()

Related

Scrapy and XPath issue with nested Xpaths

I'm trying to read Amazon products into scrapy.
Starting from a random category using this XPath:
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
item['url'] = product.xpath('//a[#class="s-access-detail-page"]/#href').extract()[0]
yield item
('//div[#class="s-item-container"]') returns all the divs with the products on one category page - that's correct.
Now, how would I get the link to the product?
// stands for where ever in the code
a with the #class should select the right class
But I get a:
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
exceptions.IndexError: list index out of range
So my list matching this XPath must be empty - but I don't understand why?
EDIT:
The HTML would look like that:
<div class="s-item-container" style="height: 343px;">
<div class="a-row a-spacing-base">
<div class="a-column a-span12 a-text-left">
<div class="a-section a-spacing-none a-inline-block s-position-relative">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer"><img alt="Product Details" src="http://ecx.images-amazon.com/images/I/41%2BzrAY74UL._AA160_.jpg" onload="viewCompleteImageLoaded(this, new Date().getTime(), 24, false);" class="s-access-image cfMarker" height="160" width="160"></a>
<div class="a-section a-spacing-none a-text-center">
<div class="a-row a-spacing-top-mini">
<a class="a-size-mini a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<div class="a-box">
<div class="a-box-inner a-padding-mini"><span class="a-color-secondary">See more choices</span></div>
</div>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none">
<a class="a-link-normal s-access-detail-page a-text-normal" title="Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<h2 class="a-size-base a-color-null s-inline s-access-title a-text-normal">Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)</h2>
</a>
</div>
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">Legend</span></div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="http://www.amazon.com/gp/offer-listing/B0105S434A/ref=sr_1_21_olp?s=pet-supplies&ie=UTF8&qid=1435391788&sr=1-21&keywords=pet+supplies&condition=new"><span class="a-size-base a-color-price a-text-bold">$28.99</span><span class="a-letter-space"></span>new<span class="a-letter-space"></span><span class="a-color-secondary">(1 offer)</span><span class="a-letter-space"></span><span class="a-color-secondary a-text-strike"></span></a></div>
</div>
<div class="a-row a-spacing-none"><span name="B0105S434A">
<span class="a-declarative" data-action="a-popover" data-a-popover="{"max-width":"700","closeButton":"false","position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&asin=B0105S434A&contextId=search&ref=acr_search__popover"}"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">3.9 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></span></span>
<a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">48</a>
</div>
</div>
It should be:
# ------------- The dot makes the query relative to product
product.xpath('.//a[#class="s-access-detail-page"]/#title')
//a[#class="s-access-detail-page"] requires to be exactly class="s-access-detail-page", because xpath works with string but not with meaning :) When you have "multi class ", use contains function
//a[contains(concat(' ', #class, ' '), " s-access-detail-page ")]/#title

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.
You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?
This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end
To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

How can I create a custom xpath query?

This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.

How to get ID of an element using Watir where the child contains the string i search for

<div class="wrapper">
<div id="minHeightBlock" style="min-height: 430px;">
<div class="borderbox"><div class="standaloneBox">
<div class="sysHeaderContainer clearfix"> … </div>
<div class="notesForGuests"> … </div>
<div class="filterBox clearfix"> … </div>
<div class="resListHeader"> … </div>
<div id="corporaContainer" class="fullList">
<div id="c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8" class="resItem clearfix">
<div class="resTitle">
<span id="filter-empty" class="statBall statFile empty" title="Status: Empty corpus"></span>
<span class="theText">
12321 corpora
</span>
</div>
<div class="resType"> … </div>
<div class="resSize"> … </div>
<div class="resPermission private"> … </div>
<div class="resDomain"> … </div>
<div class="resDescr"> … </div>
<div class="resDetails clearfix" style="display:none;"> … </div>
</div>
<div id="c-b8c0faba-e662-4998-836f-0ee58009b7fa" class="resItem clearfix"> … </div>
<div id="c-9d02b887-4835-4606-ad4b-775b39af9f48" class="resItem clearfix"> … </div>
<div id="c-021d3ba1-db03-4c4e-81a5-294737eb5b54" class="resItem clearfix"> … </div>
This is the code of the webpage im trying to script using Watir. All i know is only the what kind of span text the element should contain. I have many of these elements and i need to colect all of the element ID values so i can use them in further actions.
I have comented the places in the above code what i know and what i need to get.
So far i have tried this code:
#b.div(:id, "pageHeader").link(:text, "Corpora").click
sleep 5
#b.div(:id, "corporaContainer").spans(:text => /TestAuto\s.*/).each do |span|
puts span.parent.attribute_value("id")
end
But no output is done. Maybe im doing something wrong. Help me get this nut shell cracked.
Your attempt was close. The problem is that span.parent only goes up to the <div class="resTitle">. You need to go up one more parent:
#b.div(:id, "corporaContainer").spans(:text => /corpora/).each do |span|
puts span.parent.parent.attribute_value("id")
end
(Note that I changed the text in the locator of the spans since TestAuto\s.* did not match the sample html.)
Alternatively, I sometimes find it better to find the divs that contain the span. This way you do not have to worry about the number of parents changing:
p #b.divs(:class => 'resItem')
.find_all { |div| div.span(:text => /corpora/).exists? }
.collect { |div| div.id }
#=> ["c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8"]
Below is a working example. Note that there are 2 important things:
The list of results is loaded asynchronously. Therefore you need to wait for the list to finish loading before capturing the results. sleep(5) might work, but you are better off using an actual wait method (since it seems to take longer than 5 seconds).
Make sure the search text actually exists on the page. In the below example, there is no "12321 corpora" title that was mentioned in the sample html.
Example:
require 'watir-webdriver'
# Title to search for:
title_text = /UniAdm/
# Go to the Corpora page:
#b = Watir::Browser.new :ff
#b.goto "https://www.letsmt.eu/Corpora.aspx"
# Wait for the results to load:
container = #b.div(:id, "corporaContainer")
container.div(:class => 'resItem').wait_until_present
# Find the matching ids:
p container.divs(:class => 'resItem')
.find_all { |div| div.span(:class => 'theText', :text => title_text).exists? }
.collect { |div| div.id }
#=> ["c-87ee80a9-e529-48b2-92be-bc8d76375478", "c-f139e781-4789-41f9-82e8-914e0e3eff81", "c-e17641d2-9364-4e87-9047-ba35580dc32f"]

In ruby when I try mytext.include? (">Model number<") is returning false

In ruby when I try mytext.include?(">Model number<") is returning false.
But mytext.include?("Model number") is returning true
What is wrong in the first condition?
mytext contains the string "Model number" inside ">" and "<"
This is relevant HTML:
<div class="bucket"> <div class="h1"><strong>Product Specifications</strong></div> <div class="content"> <div class="tsSectionHeader">Product Information</div> <div class="tsTable"> <div class="tsRow"><span class="tsLabel">Model number</span><span>516C</span></div> <div class="tsRow"><span class="tsLabel">Maximum weight recommendation</span><span>35 Pounds</span></div> <div class="tsRow"><span class="tsLabel">Material Type</span><span>Wood</span></div> </div> </div> </div>
You have to learn some HTML. > and < are part of span tag: <span></span>.
This is where the text appears:
<span class="tsLabel">Model number</span>
So a span has text Model number. You can get the text using Watir with this:
browser.span(:class => "tsLabel").text

Resources