Non-greedy XPATH to get HTML before the nearest h2 node - xpath

Is it possible to scrape XPATH non-greedy-ly? I mean for example I have this HTML:
<div>
<p>A</p>
<p>B</p>
<h2>Only until this node</h2>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
</div>
I want an XPATH which only gets the paragraphs with A and B inside. The text inside the nearest h2 node is always changing, so I need non-greedy XPATH if it is possible. Is it possible? And how?

I assume <h2>Only until this node</h2> is dynamic, you can select first index of h2
//div/h2[1]/preceding-sibling::p
var htmlString = `
<body>
<div>
<p>A</p>
<p>B</p>
<h2>Only until this node</h2>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
</div>
<div>
<p>A1</p>
<p>B2</p>
<p>C3</p>
<h2>Second Only until this node</h2>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
</div>
</body>`;
var doc = new DOMParser().parseFromString(htmlString, 'text/xml');
var iterator = doc.evaluate('//div/h2[1]/preceding-sibling::p', doc, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null);
var thisNode = iterator.iterateNext();
while (thisNode) {
console.log(thisNode.outerHTML);
thisNode = iterator.iterateNext();
}

Try this xpath
//div/p[following::h2[contains(.,'Only until this node')]]
to get desired content out of the html elements until it hits the p element containing this text Only until this node.
Check out the example below:
from scrapy import Selector
htmldoc="""
<div>
<p>A</p>
<p>B</p>
<p>C</p>
<p>D</p>
<h2>Only until this node</h2>
<p>E</p>
<p>F</p>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
<p>I should not even this</p>
</div>
"""
sel = Selector(text=htmldoc)
for item in sel.xpath("//div/p[following::h2[contains(.,'Only until this node')]]/text()").extract():
print(item)
What it produces:
A
B
C
D

You can try the following XPath-1.0 expression:
/div/p[following-sibling::*[self::h2='Only until this node']]
It gets all p elements which have a h2 successor with the text() value "Only until this node".

Related

HtmlAgilityPack - SelectSingleNode for descendants

I found that HtmlAgilityPack SelectSingleNode always starts from the first node of the original DOM. Is there an equivalent method to set its starting node ?
Sample html
<html>
<body>
Home
<div id="contentDiv">
<tr class="blueRow">
<td scope="row">target</td>
</tr>
</div>
</body>
</html>
Not working code
//Expected:iwantthis.com Actual:home.com,
string url = contentDiv.SelectSingleNode("//tr[#class='blueRow']")
.SelectSingleNode("//a") //What should this be ?
.GetAttributeValue("href", "");
I have to replace the code above with this:
var tds = contentDiv.SelectSingleNode("//tr[#class='blueRow']").Descendants("td");
string url = "";
foreach (HtmlNode td in tds)
{
if (td.Descendants("a").Any())
{
url= td.ChildNodes.First().GetAttributeValue("href", "");
}
}
I am using HtmlAgilityPack 1.7.4 on .Net Framework 4.6.2
The XPath you are using always starts at the root of the document. SelectSingleNode("//a") means start at the root of the document and find the first a anywhere in the document; that's why it grabs the Home link.
If you want to start from the current node, you should use the . selector. SelectSingleNode(".//a") would mean find the first a that is anywhere beneath the current node.
So your code would look like this:
string url = contentDiv.SelectSingleNode(".//tr[#class='blueRow']")
.SelectSingleNode(".//a")
.GetAttributeValue("href", "");

Scraping the href value of anchor in Ruby

Working on this project where I have to scrape a "website," which is just a an html file in one of the local folders. Anyway, I've been trying to scrape down to the href value (a url) of the anchor tag for each student object. I am also scraping for other things, so ignore the rest. Here is what I have so far:
def self.scrape_index_page(index_url) #responsible for scraping the index page that lists all of the students
#return an array of hashes in which each hash represents one student.
html = index_url
doc = Nokogiri::HTML(open(html))
# doc.css(".student-name").first.text
# doc.css(".student-location").first.text
#student_card = doc.css(".student-card").first
#student_card.css("a").text
end
Here is one of the student profiles. They are all the same, so I'm just interested in scraping the href url value.
<div class="student-card" id="eric-chu-card">
<a href="students/eric-chu.html">
<div class="view-profile-div">
<h3 class="view-profile-text">View Profile</h3>
</div>
<div class="card-text-container">
<h4 class="student-name">Eric Chu</h4>
<p class="student-location">Glenelg, MD</p>
</div>
</a>
</div>
thanks for your help!
Once you get an anchor tag in Nokogiri, you can get the href like this:
anchor["href"]
So in your example, you could get the href by doing the following:
student_card = doc.css(".student-card").first
href = student_card.css("a").first["href"]
If you wanted to collect all of the href values at once, you could do something like this:
hrefs = doc.css(".student-card a").map { |anchor| anchor["href"] }

Selenium webdriver: How to find nested tags?

A webpage contains
<div class="divclass">
<ul>
<li>
"hello world 1"
<img src="abc1.jpg">
</li>
<li>
"hello world 2"
<img src="abc2.jpg">
</li>
</ul>
</div>
I am able to get data under div using
element = driver.find_element(class: "divclass")
element.text.split("\n")
But I want all links respective to the achieved data
I tried using
driver.find_elements(:css, "div.divclass a").map(&:text)
but failed.
How can I get related links to the data?
If you want to get the href attribute try the below code(I am not familiar with ruby so I am posting the code in Java).
List<WebElement> elements = driver.findElements(By.xpath("//*[#class='divclass']//a"));
for(WebElement webElement:elements){
System.out.println(webElement.getAttribute("href"));
}
The xpath points to all the a tags under the div tag with class name =divclass.
If you want to get the text of all the links, you can use the blow code:
List<WebElement> elements = driver.findElements(By.xpath("//*[#class='divclass']//a"));
for(WebElement webElement:elements){
System.out.println(webElement.getText());
}
Hope it helps.
In ruby
element = driver.find_elements(:xpath, "//*[#class='divclass']//a")
list = element.collect{|e| hash ={e.text => e.attribute("href")}}
will return corresponding links with data in array of hashes

html-agility-pack extract a background image

How do I extract the url from the following HTML.
i.e.. extract:
http://media.somesite.com.au/img-101x76.jpg
from:
<div class="media-img">
<div class=" searched-img" style="background-image: url(http://media.somesite.com.au/img-101x76.jpg);"></div>
</div>
In XPath 1.0 in general, you can use combination of substring-after() and substring-before() functions to extract part of a text. But HAP's SelectNodes() and SelectSingleNode() can't return other than node(s), so those XPath functions won't help.
One possible approach is to get the entire value of style attribute using XPath & HAP, then process the value further from .NET, using regex for example :
var html = #"<div class='media-img'>
<div class=' searched-img' style='background-image: url(http://media.somesite.com.au/img-101x76.jpg);'></div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'searched-img')]");
var url = Regex.Match(div.GetAttributeValue("style", ""), #"(?<=url\()(.*)(?=\))").Groups[1].Value;
Console.WriteLine(url);
.NET Fiddle Demo
output :
http://media.somesite.com.au/img-101x76.jpg

Rivetsjs iteration - by using an integer instead of collection

According to rivetsjs docs, we can render content by iterating over a object (array) by,
<ul>
<li rv-each-todo="list.todos">
<input type="checkbox" rv-checked="todo.done">
<span>{ todo.summary }</span>
</li>
<ul>
but is there a way where I can iterate by using a single integer to indicate number of times the iteration to take place?
I mean something like this,
<li rv-each="list.num_of_todos">
...
where num_of_todos is an integer to indicate number of iterations to take place.
There is no "proper" way of doing it. However, you can easily mimic this using a formatter that returns an array as shown below:
var list = {
name: 'to do list',
noOfToDos: 5
};
rivets.formatters.makearray = function(value) {
var result = [];
while (value--) {
result.push(0)
};
return result;
}
rivets.bind($("ul"), { // bind rivets
list: list
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/rivets/0.7.1/rivets.bundled.min.js"></script>
<ul>
<li rv-each-item="list.noOfToDos | makearray">Test</li>
<ul>

Resources