I have this HTML, notice everything is nested inside a .listing div:
<div id="listing_1085130_featured" class="item listing 1085130 even featured selected" data-blockindex="0" se:map:point="40.7219,-74.0034" se:map="map" se:behavior="selectable hoverable rememberable clickable mappable" style="cursor: pointer;">
<div class="item_inner ">
<div class="featured_tag hidden-xs">Featured Listing</div>
<div class="selected_marker hidden-xs hidden-sm">
<div id="results_list" class="photo">
<a href="/building/27-wooster/ph?featured=1">
<img border="0" src="https://s3.amazonaws.com/img.streeteasy.com/nyc/image/47/76017947.jpg" alt="27 Wooster Street #PH">
</a>
<div id="featured-tag-on-responsive" class="visible-xs">Featured Listing</div>
</div>
<div class="details">
<div class="details_title">
<h5>
<a se:clickable:target="true" href="/building/27-wooster/ph?featured=1">27 Wooster Street #PH</a>
</h5>
<div class="item_tools">
</div>
<div class="closer"></div>
<div class="details_info first_detail_info">
<div class="details_info">
<div class="details_info">
<div class="details_info">
</div>
<div class="closer"></div>
</div>
</div>
....
I have a bunch of these and How would I grab the href of the first link inside #results_list, which would be /building/27-wooster/ph?featured=1 in this case.
This is my method so far:
require 'json'
require 'open-uri'
require 'nokogiri'
def scrape(page_number)
doc = Nokogiri::HTML(open("http://streeteasy.com/for-sale/soho?page=#{page_number}sort_by=price_desc"))
doc.css(".listing").each do |listing|
# grab data inside that specific listing
end
end
Is there a way to look within just that listing? like listing.children("#results_list a").first.href
Well this worked for me:
doc.css("#results_list/a").each do |listing|
p listing['href']
end
To get just the first listing, use at_css, replacing the code above with this one line should produce the same result:
doc.at_css("#results_list/a")['href']
Is there a way to look within just that listing?
Yes, but in html an id has to be unique to the page, so it's doubtful that all your .listing divs each contain a div with an id="results_list". However, nokogiri doesn't seem to have a problem with multiple identical ids:
require 'nokogiri'
html = <<'END_OF_HTML'
<div class="item listing 1085130 even featured selected">
<div>
<div id="results_list" class="photo">
hello
apple
</div>
</div>
</div>
<div class="item listing 1085131 even featured selected">
<div>
<div id="results_list" class="photo">
world
cherry
</div>
</div>
</div>
<div class="item listing 1085132 even featured selected">
<div>
<div id="results_list" class="photo">
goodbye
peach
</div>
</div>
</div>
END_OF_HTML
doc = Nokogiri::HTML(html)
doc.css(".listing").each do |div|
a_tag = div.at_xpath('.//div[#id="results_list"]/a')
puts a_tag.text
end
--output:--
hello
world
goodbye
at_xpath() searches for the first matching element.
.// searches within the current element
Related
I try to extract all links based on these three conditions:
Must be part of <div data-test="cond1">
Must have a <a href="..." class="cond2">
Must not have a <img src="..." class="cond3">
The result should be "/product/1234".
<div data-test="test1">
<div>
<div data-test="cond1">
Link 1
<div class="test4">
<div class="test5">
<div class="test6">
<div class="test7">
<div class="test8">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div data-test="test2">
<div>
<div data-test="cond1">
Link 2
<div class="test4">
<div class="test5">
<div class="test6">
<div class="test7">
<div class="test8">
<img src="bild.jpg" class="cond3">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
I'm able to extract the links with the following xpath query.
//div[starts-with(#data-test,"cond")]/a[starts-with(#class,"cond")]/#href
(I know the first part is not really neccessary. But better safe than sorry.)
But I'm still struggling with excluding the links containing an descendant img tag and how to add it to the query above.
This should do what you want:
//div[#data-test="cond1" and not(.//img[#class="cond3"])]
/a[#class="cond2"]
/#href
/product/1234
I am trying to scrape multiple elements with the same class names but each has a different number of children. I am looking for a way to select specific elements using the xpath(this would make it easiest for my loop).
const gameTimeElement = await page.$$('//*[#id="section-content"]/div[2]/div[1]/div/div['+ i + ']');
const gameTimeString = await gameTimeElement[j].$eval('h3', (h3) => h3.innerHTML);
This currently does not work.
After I select the element, I grab the h3 tag inside and evaluate it to get the innerHTML.
Is there a way to do this utilizing xpath?
<div id="section-content" style="display: block;">
</div>
<div class="matches">
<div class="day day-28-1" data-week="1" style="display: block;">
<h4>Sat, March 28, 2020</h4>
<div class="day-wrap">
<div class="match region-7-57d5ab4-9qs98v" data-week="1">
<h3 class="time">2:00PM
<span>(Central Daylight Time)</span>
<span class="fr">Best of 7</span>
</h3>
<div class="row ac ">
<div class="col-xs-3 ar">
<img class="team-logo" src="url"></div>
<div class="col-xs-2 al">
<h4 class="loss">(NA)<br>
<span class="team-name">Team1</span>
<br>
<span class="win spoiler-wrap">0</span>
</h4>
</div>
<div class="col-xs-2">
<img class="league-logo" src="url">
<h4> V.S.</h4>
</div>
<div class="col-xs-2 ar">
<h4 class="">(NA)<br>
<span class="team-name">Team2</span>
<br>
<span class="win spoiler-wrap">4</span>
</h4>
</div>
This is a sample of what I am working with for HTML on the website.
Yes, div class="day-wrap" could have a different number of childs. But I don't think that's a problem.
You want to get game times of all Rocket League matches. As you've noticed, games times are located within h3 elements. You can access it directly with one of the following XPaths :
//div[#id="section-content"]//h3
//div[#class="day-wrap"]//h3
//div[contains(#class,"match region")]//h3
If you want something for a loop then you can try :
(//div[#class="day-wrap"]//h3)[i]
where i is the number to increment (from 1 to x).
Side notes : your sample data looks incorrect (according to your XPath). You have a closing div line 2 and it seems you omit div class="row middle-xs center-xs weeks" before div class="matches".
The question is simple but I don't have enough practice for this case :)
How to get price text value from every div within "block" if we know that we need only item_promo elements.
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">123</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">456</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">789</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">222</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">333</div>
</div>
You could use the xpath :
//div[#class='block']/*[#class='item_promo']/following-sibling::div[#class='item_price']/text()
You look for div elements that has attribute class with value item_promo and look at its following sibling which has an attribute item_price and grab the text.
This XPath,
//div[div/#class='item_promo']/div[#class='item_price']
will return those item_price class div elements with sibling item_promo class div elements:
<div class="item_price">123</div>
<div class="item_price">456</div>
<div class="item_price">789</div>
This will work regardless of label/price order.
Let's say I have this html that has various depths of descendants and a mixture of element types:
<div class="foo">
<div class="bar"></div>
</div>
<div class="foo">
<div class="baz"></div>
</div>
<div class="foo">
<u><span class="duh">
<div class="bar"></div>
</span></u>
</div>
<div class="foo">
<div class="baz"></div>
</div>
And I want to apply a class of bex to all the foos that contain classes of bar so it looks like:
<div class="bex">
<div class="bar"></div>
</div>
<div class="foo">
<div class="baz"></div>
</div>
<div class="bex">
<u><span class="duh">
<div class="bar"></div>
</span></u>
</div>
<div class="foo">
<div class="baz"></div>
</div>
How wld I do that with ruby/nokogiri? Tried all sorts of things and can't quite get it. Thanks.
Edit: closed the duh, oops.
I spent a long time wondering why the second foo wasn't found.
Your data is broken, "duh isn't closed.
To select the nodes, you can use :
doc.xpath("//div[#class='foo' and .//div[#class='bar']]")
As an example :
data = %q(<div class="foo">
<div class="bar"></div>
</div>
<div class="foo">
<div class="baz"></div>
</div>
<div class="foo">
<u><span class="duh">
<div class="bar"></div>
</span></u>
</div>
<div class="foo">
<div class="baz"></div>
</div>)
require 'nokogiri'
doc = Nokogiri.HTML(data)
doc.xpath("//div[#class='foo' and .//div[#class='bar']]").each do |node|
node["class"] = 'bex'
end
puts doc
So, I am building a web crawler for one site's comment section, and I have came with a problem, it seems I can't find a text node for the comments content. This is how the web pages element looks:
<div class="comments"> // this is the whole comments section
<div class="comment"> // this is where the p is located
<div class="comment-top">
<div class="comment-nr">208. PROTAS</div>
<div class="comment-info">
<div class="comment-time">2015-06-30 13:00</div>
<div class="comment-ip">IP: 178.250.32.165</div>
<div class="comment-vert1">
<a href="javascript:comr(24470645,'p')">
<img src="http://img.lrytas.lt/css2/img/com-good.jpg" alt="">
</a> <span id="cy_24470645"> </span>
</div>
<div class="comment-vert2">
<a href="javascript:comr(24470645,'m')">
<img src="http://img.lrytas.lt/css2/img/com-bad.jpg" alt="">
</a> <span id="cn_24470645"> </span>
</div>
</div>
</div>
<p class="text-13 no-intend">Test text</p> // I need to get this comments content
</div>
I tried a lot of xpath's like:
*/div[contains(#class, "comment")]/p/text()
/p[contains(#class, "text-13 no-intend")]/text()
etc.
But can't seem able to locate it.
Would appreciate any help.
How about this:
//div[#class = 'comments']/div[#class = 'comment'][1]/p/text()