CSS Selector for group of elements? - ruby

I'm trying to scrape an HTML site with this structure:
<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>
I need to grab all of the p, h3 and ul tags between the two a[name] anchor elements.
Right now I successfully grabbed the first p:
a[name='how'] + div + p
but I'm not sure how to grab all of the elements between the two.
This is being used within ScrAPI ruby scraping library that accepts all valid CSS selectors.

I don't believe this can be done in a single CSS selector, but would love to be proven wrong.
It can be done in a single XPath expression, however:
//*[preceding-sibling::a/#name="how" and following-sibling::a/#name="other-uses"]
so if an alternate scraping library is an option, such as Mechanize (which uses Nokogiri, an XPath-compliant HTML parser), then it can be done using the XPath above.
EDIT: for completeness, here's a fully functioning script that demonstrates the xpath using the Nokogiri HTML parser.
require 'rubygems'
require 'nokogiri'
html =<<ENDOFHTML
<html>
<body>
<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>
</body>
</html>
ENDOFHTML
doc = Nokogiri::HTML.parse(html)
puts doc.xpath('//*[preceding-sibling::a/#name="how" and following-sibling::a/#name="other-uses"]')
Result:
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>

Related

Scrapy xpath select parent element based on text value in subelement and lacking of element

I want to select all elements article that don't contain a span element with class status and where the nested a element contains a href attribute which contains the text "rent.html".
I've managed to get the a element like so:
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]')
But reading here and trying to select the first parent element article like so returns "data=0"
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]//parent::article and not //article[#class="car"]//span[#class="status"]')
I also tried this.
response.xpath('//article[#class="car"][//a[contains(#href,"rent.html")]/article and not //article[#class="car"]//span[#class="status"]')')
I don't know what the expression is for my use case.
<article class="car">
<div>
<div class="container">
<a href="/34625030/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34625230/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/12325230/buy.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34632230/rent.html">
</a>
</div>
</div>
<span class="status">Rented</span>
</article>
This XPath expression will do the work:
"//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]"
The entire command is:
response.xpath("//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]")
Explanations:
Translating your requirements into XPath syntax.
"select all elements article" - //article
"that don't contain a span element with class status" - [not(.//span[#class='status'])]
" and where the nested a element contains a href attribute which contains the text "rent.html"" - [.//a[contains(#href,'rent.html')]]
I tested the XPath above on the shared sample XML and it worked properly.

Match tag inside tag using bash

I have this html
<article class="article column large-12 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14343208/">
<div class="article__content">
<h2 class="article__title t54 tm24">Person har falt ned bratt terreng - luftambulanse er på vei</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14341466/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">Vil styrke innsatsen mot vold i nære relasjoner</h2>
</div>
</a>
</article>
The thing is that I want to get only those html tags, in this case article tags, which has a child img tag inside them.
I have this sed command
sed -n '/<article class.*article--nyheter/,/<\/article>/p' onlyArticlesWithOutSpace.html > test.html
Now what I am trying ti achieve is to get only those article tags which has img tag inside them.
Output I want would be this
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14341466/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7" />
I cannot use any xml/html parser. Just looking to use sed, grep, awk etc.
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">Vil styrke innsatsen mot vold i nære relasjoner</h2>
</div>
</a>
</article>
Care: parsing XML using sed is a wrong good idea!
Thanks to Cyrus's comment for pointing to good reference.
Anyway, U could try this:
sed -ne '/<article/{ :a; N; /<\/article/ ! ba ; /<img/p ; }'

Select first 4 children (same attributes) of the parent node

How to select the first 4 children(same attributes) of the parent node having more than 3 children that also have the same attributes from the one, I want to select ?
I have tried this code but, its not working :-
//div[#class='content-page minified']/*[self::h2 or p[:2]]
My code:
<div class = "content-page minified">
<h2> Company Description </h2>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<h2> Mission Description</h2>
<p>...</p>
<ul>...</ul>
<p>...</p>
<h2>Requirements</hs>
<ul>...</ul>
<a class="my child class" href="#">...</a>
<div class="my second child class" href="#">...</div>
</div>
I expect to select both <h2> and first 3 <p> tags.
To get the first two <p> tags after the first <h2> tag, using lxml, try
import lxml.html
str = """
<div class = "content-page minified">
<h2> Company Description </h2>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<h2> Mission Description</h2>
<p>...</p>
<ul>...</ul>
<p>...</p>
<h2>Requirements</hs>
<ul>...</ul>
<a class="my child class" href="#">...</a>
<div class="my second child class" href="#">...</div>
</div>
"""
h= tree.xpath("//div[#class='content-page minified']/*['h2'][1]/following-sibling::p[position()<3]")

Ruby + Nokogiri: Looping through array of divs and finding text within them

I have this HTML, notice everything is nested inside a .listing div:
<div id="listing_1085130_featured" class="item listing 1085130 even featured selected" data-blockindex="0" se:map:point="40.7219,-74.0034" se:map="map" se:behavior="selectable hoverable rememberable clickable mappable" style="cursor: pointer;">
<div class="item_inner ">
<div class="featured_tag hidden-xs">Featured Listing</div>
<div class="selected_marker hidden-xs hidden-sm">
<div id="results_list" class="photo">
<a href="/building/27-wooster/ph?featured=1">
<img border="0" src="https://s3.amazonaws.com/img.streeteasy.com/nyc/image/47/76017947.jpg" alt="27 Wooster Street #PH">
</a>
<div id="featured-tag-on-responsive" class="visible-xs">Featured Listing</div>
</div>
<div class="details">
<div class="details_title">
<h5>
<a se:clickable:target="true" href="/building/27-wooster/ph?featured=1">27 Wooster Street #PH</a>
</h5>
<div class="item_tools">
</div>
<div class="closer"></div>
<div class="details_info first_detail_info">
<div class="details_info">
<div class="details_info">
<div class="details_info">
</div>
<div class="closer"></div>
</div>
</div>
....
I have a bunch of these and How would I grab the href of the first link inside #results_list, which would be /building/27-wooster/ph?featured=1 in this case.
This is my method so far:
require 'json'
require 'open-uri'
require 'nokogiri'
def scrape(page_number)
doc = Nokogiri::HTML(open("http://streeteasy.com/for-sale/soho?page=#{page_number}sort_by=price_desc"))
doc.css(".listing").each do |listing|
# grab data inside that specific listing
end
end
Is there a way to look within just that listing? like listing.children("#results_list a").first.href
Well this worked for me:
doc.css("#results_list/a").each do |listing|
p listing['href']
end
To get just the first listing, use at_css, replacing the code above with this one line should produce the same result:
doc.at_css("#results_list/a")['href']
Is there a way to look within just that listing?
Yes, but in html an id has to be unique to the page, so it's doubtful that all your .listing divs each contain a div with an id="results_list". However, nokogiri doesn't seem to have a problem with multiple identical ids:
require 'nokogiri'
html = <<'END_OF_HTML'
<div class="item listing 1085130 even featured selected">
<div>
<div id="results_list" class="photo">
hello
apple
</div>
</div>
</div>
<div class="item listing 1085131 even featured selected">
<div>
<div id="results_list" class="photo">
world
cherry
</div>
</div>
</div>
<div class="item listing 1085132 even featured selected">
<div>
<div id="results_list" class="photo">
goodbye
peach
</div>
</div>
</div>
END_OF_HTML
doc = Nokogiri::HTML(html)
doc.css(".listing").each do |div|
a_tag = div.at_xpath('.//div[#id="results_list"]/a')
puts a_tag.text
end
--output:--
hello
world
goodbye
at_xpath() searches for the first matching element.
.// searches within the current element

The HTML rel tag

I am trying to validate my HTML5 document with the w3c. I am using the fancybox jQuery plugin for simple light boxes and image galleries. In order to differentiate between each image gallery I am using the rel tag.
When I validate my page I get the following error:
Bad value gallery for attribute rel on element a: Not an absolute IRI. The string gallery is not a registered keyword or absolute URL.
Here is my code:
<div class="portItem">
<div class="thumbs">
<div class="items">
<img src="images/rsl.jpg" class="col" alt="A website for R.S.Lynch and Company"/>
<div class="caption">
<a class="fancybox" rel="gallery1" href="images/rs1.jpg">R.S.Lynch & Company</a>
<div class="hidden">
<a class="fancybox" rel="gallery1" href="images/rs2.jpg"></a>
<a class="fancybox" rel="gallery1" href="images/rs3.jpg"></a>
<a class="fancybox" rel="gallery1" href="images/rs4.jpg"></a>
<a class="fancybox" rel="gallery1" href="images/rs5.jpg"></a>
</div>
</div>
</div>
</div>
</div>
Is there a better tag to use and get the same result? Thanks
I'd suggest using data-family attribute here, like this:
<a class="fancybox" data-gallery="1" href="images/rs1.jpg">R.S.Lynch & Company</a>
<a class="fancybox" data-gallery="1" href="images/rs2.jpg">...</a>
<a class="fancybox" data-gallery="1" href="images/rs3.jpg">...</a>
<a class="fancybox" data-gallery="1" href="images/rs4.jpg">...</a>
<a class="fancybox" data-gallery="1" href="images/rs5.jpg">...</a>
... as this type of attribute was designed specifically for attaching some data to DOM elements. It's more semantic than using class, in my opinion.
You can easily access these values with $('some selector').data('gallery') syntax.
As for rel attribute, it looks like in HTML5 it's restricted to the set of predefined attributes, and is used to define more high-level relationships between documents.
You could add a class for each gallery instead: class="fancybox gallery1"
<a class="fancybox" data-gallery="1" href="images/img1.jpg">...</a>
<a class="fancybox" data-gallery="1" href="images/img2.jpg">...</a>
$(".fancybox").attr('rel', 'data-gallery').fancybox();

Resources