Xpath expression to get a href description of a given class name - xpath

in this div class:
<div class="black">
Vitamin
Watergate
</div>
I need an Xpath expression to get just the a href tag Description text, in this example "Vitamin" and "Watergate".

/div[#class='black']/a[0]/text()
will return Vitamin and
/div[#class='black']/a[1]/text()
will return Watergate.
Regards
You better google that first.

Related

Scraping the href value of anchor in Ruby

Working on this project where I have to scrape a "website," which is just a an html file in one of the local folders. Anyway, I've been trying to scrape down to the href value (a url) of the anchor tag for each student object. I am also scraping for other things, so ignore the rest. Here is what I have so far:
def self.scrape_index_page(index_url) #responsible for scraping the index page that lists all of the students
#return an array of hashes in which each hash represents one student.
html = index_url
doc = Nokogiri::HTML(open(html))
# doc.css(".student-name").first.text
# doc.css(".student-location").first.text
#student_card = doc.css(".student-card").first
#student_card.css("a").text
end
Here is one of the student profiles. They are all the same, so I'm just interested in scraping the href url value.
<div class="student-card" id="eric-chu-card">
<a href="students/eric-chu.html">
<div class="view-profile-div">
<h3 class="view-profile-text">View Profile</h3>
</div>
<div class="card-text-container">
<h4 class="student-name">Eric Chu</h4>
<p class="student-location">Glenelg, MD</p>
</div>
</a>
</div>
thanks for your help!
Once you get an anchor tag in Nokogiri, you can get the href like this:
anchor["href"]
So in your example, you could get the href by doing the following:
student_card = doc.css(".student-card").first
href = student_card.css("a").first["href"]
If you wanted to collect all of the href values at once, you could do something like this:
hrefs = doc.css(".student-card a").map { |anchor| anchor["href"] }

Xpath to parse the background image url from a style tag

<div class="card-image" style="background-image: url("https://cdn6.bigcommerce.com/s-0kvv9/images/stencil/500x659/products/170691/242554/dicemastgreenflash__36803.1503934716.jpg?c=2");">
</div>
When I input this,
.//*[#class='card-image']/#style
it returns this:
background-image:url('https://cdn6.bigcommerce.com/s-0kvv9/images/stencil/500x659/products/170691/242554/dicemastgreenflash__36803.1503934716.jpg?c=2');
I only want it to return the URL.
This XPath,
substring-before(substring-after(.//*[#class='card-image']/#style, "url('"), ");")
will select only the URL, as requested.

How do I extract a class from this HTML with XPath

<div class="a-row a-spacing-micro" style="">
<i class="a-icon a-icon-star-medium a-star-medium-4"></i>
<a data-analytics="{"name":"Review.FullReview"}" class="a-size-base a-link-normal a-color-base review-title a-text-bold" href="/gp/cdp/member-reviews/A19123D9G66E0O/ref=pdp_new_read_full_review_link?ie=UTF8&page=1&sort_by=MostRecentReview#R1Z0A6K9CROFFV"> <span>Good Cheap Knee Pads</span>
</a>
</div>
I have this HTML that I am scraping with XPath. What XPath would I use to just return the class "a-star-medium-4"?
Thanks!
Jeff
If it's only for this specific HTML, you can extract the class name starting with a-star with this XPath:
substring(string(//i/#class),string-length(substring-before(string(//i/#class),'a-star')) +1)
When applied to your example HTML this returns a-star-medium-4.
As explanation: string(//i/#class) returns the class attribute value a-icon a-icon-star-medium a-star-medium-4. To get only the class name starting with a-star, substring() is used to remove the part of the string before a-star by cutting the string after the string-length() of the remaining string when it's cutted before a-star using substring-before().

Select href with id and class using xpath

let say I have DOM like this:
<div id="tabsmenu">
<ul>
<li class="one">foo</li>
<li class="two">baz </li>
</ul>
</div>
and I would like to get the text from <a href> elements:
# desired output: ['#foo', '#baz']
How to do it using xpath and using combination id and element with a specific class within id ?
Already tried:
some_doc.xpath('//a[#id="tabsmenu"]/[#class="ui-tabs-anchor"]/#href')
# select all href tags of any a element that is in id tabsmenu and class attribute ui- tabs-anchor
EDIT - corrected tabmenu into tabsmenu
You're most likely looking for something like this:
//div[#id='tabsmenu']//a[#class='ui-tabs-anchor']/#href
That will get all href attributes that are part of an a tag with the class ui-tabs-anchor and inside a div element with the id tabsmenu.
Also you might want to take a look at this question:
Find out if class name contains certain text
This is because the class will match the exact value (ui-tabs-anchor) and maybe some additional class might be added there such as class="ui-tabs-anchor disabled" and then there will not be a match in there.

HtmlAgilityPack Div Class Contains String

I'm trying to scrape only article text from web pages. I have discovered that the article is always surrounded with div tags. Unfortunately the class of these div tags is slightly different for each web page. I looked into using XPath but I don't think it will work due to the different class names. Is there a way I can get all the div tags and then get the class?
Examples
<div class="entry_single">
<p>I recently traveled without my notebook for the first time in ages.</p>
</div>
<div class="entry-content-pagination">
<p>Ward 9 Ald. Steven Dove</p>
</div>
That'd be easier using Linq.
foreach(HtmlNode div in doc.DocumentNode.Descendants("div"))
{
string className = div.GetAttributeValue("class", string.Empty);
// do something with class name
}

Resources