How to select the first 4 children(same attributes) of the parent node having more than 3 children that also have the same attributes from the one, I want to select ?
I have tried this code but, its not working :-
//div[#class='content-page minified']/*[self::h2 or p[:2]]
My code:
<div class = "content-page minified">
<h2> Company Description </h2>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<h2> Mission Description</h2>
<p>...</p>
<ul>...</ul>
<p>...</p>
<h2>Requirements</hs>
<ul>...</ul>
<a class="my child class" href="#">...</a>
<div class="my second child class" href="#">...</div>
</div>
I expect to select both <h2> and first 3 <p> tags.
To get the first two <p> tags after the first <h2> tag, using lxml, try
import lxml.html
str = """
<div class = "content-page minified">
<h2> Company Description </h2>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<h2> Mission Description</h2>
<p>...</p>
<ul>...</ul>
<p>...</p>
<h2>Requirements</hs>
<ul>...</ul>
<a class="my child class" href="#">...</a>
<div class="my second child class" href="#">...</div>
</div>
"""
h= tree.xpath("//div[#class='content-page minified']/*['h2'][1]/following-sibling::p[position()<3]")
Related
I want to select all elements article that don't contain a span element with class status and where the nested a element contains a href attribute which contains the text "rent.html".
I've managed to get the a element like so:
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]')
But reading here and trying to select the first parent element article like so returns "data=0"
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]//parent::article and not //article[#class="car"]//span[#class="status"]')
I also tried this.
response.xpath('//article[#class="car"][//a[contains(#href,"rent.html")]/article and not //article[#class="car"]//span[#class="status"]')')
I don't know what the expression is for my use case.
<article class="car">
<div>
<div class="container">
<a href="/34625030/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34625230/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/12325230/buy.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34632230/rent.html">
</a>
</div>
</div>
<span class="status">Rented</span>
</article>
This XPath expression will do the work:
"//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]"
The entire command is:
response.xpath("//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]")
Explanations:
Translating your requirements into XPath syntax.
"select all elements article" - //article
"that don't contain a span element with class status" - [not(.//span[#class='status'])]
" and where the nested a element contains a href attribute which contains the text "rent.html"" - [.//a[contains(#href,'rent.html')]]
I tested the XPath above on the shared sample XML and it worked properly.
The question is simple but I don't have enough practice for this case :)
How to get price text value from every div within "block" if we know that we need only item_promo elements.
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">123</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">456</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">789</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">222</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">333</div>
</div>
You could use the xpath :
//div[#class='block']/*[#class='item_promo']/following-sibling::div[#class='item_price']/text()
You look for div elements that has attribute class with value item_promo and look at its following sibling which has an attribute item_price and grab the text.
This XPath,
//div[div/#class='item_promo']/div[#class='item_price']
will return those item_price class div elements with sibling item_promo class div elements:
<div class="item_price">123</div>
<div class="item_price">456</div>
<div class="item_price">789</div>
This will work regardless of label/price order.
There are HTML:
<article>
//some levels...
<div class="address item">
<a class="address">
Address 1
</a>
</div>
<div class="address item">
<a class="address">
Address 2
</a>
</div>
</article>
<article>
//some levels...
<div class="address item">
<a class="address">
Address 2
</a>
</div>
<div class="address item">
<a class="address">
Address 3
</a>
</div>
<div class="address item">
<a class="address">
Address 4
</a>
</div>
</article>
<article>
//some levels...
<div class="address item">
<a class="address">
Address 1
</a>
</div>
</atricle>
I need find article where NO text Address 1 in all a elements (only one article in this example). I use (//div[#class="address item"]/descendant::a[not(contains(text(), "Address 1"))])/ancestor::article but it still find article where are Address 1.
Try this one and let me know if it still doesn't meet requirements:
//article[not(.//a[normalize-space(text())="Address 1"])]
I am creating rich snippets for my webshop. One of the itemtypes I use is the "Organization" type. The problem with this is that I have specified the Organisation name and the image in the header of my webshop and the address in the footer. In between is the rest of the webshop with all it's products, reviews etc.
When I test my rich snippets with http://www.google.nl/webmasters/tools/richsnippets, I get two separate Organisations instead of one. Is there a way to combine my two scopes to become one Organisation?
Here is the situation I have right now:
<div id="header" itemscope itemtype="http://schema.org/Organization">
<h1 itemprop="name">Webshopname</h1>
<img id="logo" itemprop="logo" src="https://webshopurl/img/webshop-logo.png">
</div>
<div class="whole_article" itemscope itemtype="http://schema.org/Product">
<h1 itemprop="name">Articlename</h1>
</div>
<div id="footer" itemscope itemtype="http://schema.org/Organization">
<div id="address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<div itemprop="streetAddress">Address 12</div>
<div itemprop="postalCode">Postalcode</div>
<div itemprop="addressLocality">Locality</div>
</div>
</div>
Don’t create several items about the same thing on the same page.
You can use the itemref attribute to add properties to an item that are not nested in the same element:
<div id="header" itemscope itemtype="http://schema.org/Organization" itemref="address">
<h1 itemprop="name">Webshopname</h1>
<img id="logo" itemprop="logo" src="https://webshopurl/img/webshop-logo.png">
</div>
<div class="whole_article" itemscope itemtype="http://schema.org/Product">
<h1 itemprop="name">Articlename</h1>
</div>
<div id="footer">
<div id="address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<div itemprop="streetAddress">Address 12</div>
<div itemprop="postalCode">Postalcode</div>
<div itemprop="addressLocality">Locality</div>
</div>
</div>
I'm trying to scrape an HTML site with this structure:
<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>
I need to grab all of the p, h3 and ul tags between the two a[name] anchor elements.
Right now I successfully grabbed the first p:
a[name='how'] + div + p
but I'm not sure how to grab all of the elements between the two.
This is being used within ScrAPI ruby scraping library that accepts all valid CSS selectors.
I don't believe this can be done in a single CSS selector, but would love to be proven wrong.
It can be done in a single XPath expression, however:
//*[preceding-sibling::a/#name="how" and following-sibling::a/#name="other-uses"]
so if an alternate scraping library is an option, such as Mechanize (which uses Nokogiri, an XPath-compliant HTML parser), then it can be done using the XPath above.
EDIT: for completeness, here's a fully functioning script that demonstrates the xpath using the Nokogiri HTML parser.
require 'rubygems'
require 'nokogiri'
html =<<ENDOFHTML
<html>
<body>
<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>
</body>
</html>
ENDOFHTML
doc = Nokogiri::HTML.parse(html)
puts doc.xpath('//*[preceding-sibling::a/#name="how" and following-sibling::a/#name="other-uses"]')
Result:
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>