Working on this project where I have to scrape a "website," which is just a an html file in one of the local folders. Anyway, I've been trying to scrape down to the href value (a url) of the anchor tag for each student object. I am also scraping for other things, so ignore the rest. Here is what I have so far:
def self.scrape_index_page(index_url) #responsible for scraping the index page that lists all of the students
#return an array of hashes in which each hash represents one student.
html = index_url
doc = Nokogiri::HTML(open(html))
# doc.css(".student-name").first.text
# doc.css(".student-location").first.text
#student_card = doc.css(".student-card").first
#student_card.css("a").text
end
Here is one of the student profiles. They are all the same, so I'm just interested in scraping the href url value.
<div class="student-card" id="eric-chu-card">
<a href="students/eric-chu.html">
<div class="view-profile-div">
<h3 class="view-profile-text">View Profile</h3>
</div>
<div class="card-text-container">
<h4 class="student-name">Eric Chu</h4>
<p class="student-location">Glenelg, MD</p>
</div>
</a>
</div>
thanks for your help!
Once you get an anchor tag in Nokogiri, you can get the href like this:
anchor["href"]
So in your example, you could get the href by doing the following:
student_card = doc.css(".student-card").first
href = student_card.css("a").first["href"]
If you wanted to collect all of the href values at once, you could do something like this:
hrefs = doc.css(".student-card a").map { |anchor| anchor["href"] }
Related
I'm trying to scrape the app URLs from a directory that's laid out in a grid:
<div id="mas-apps-list-tile-grid" class="mas-app-list">
<div class="solution-tile-container">
<div class="solution-tile-content-container">
<a href="url.com/app/345">
<div class="solution-tile-container">
<div class="solution-tile-content-container">
<a href="url.com/app/567">
... and so on
Here are my 2 lines of Watir code that are supposed to create an array with all URLs from a page:
company_listings = browser.div(id: 'mas-apps-list-tile-grid')
companies = company_listings.map { |div| div.a.href }
But instead of an array with URLs, 'companies' returns:
#<Watir::Map: located: false; {:id=>"mas-apps-list-tile-grid", :tag_name=>"div"} --> {:tag_name=>"map"}>
What am I doing wrong?
The #map method for a Watir::Element (or specifically Watir::Div in this case) returns a Watir::Map element. This is used for locating <map> tags/elements on the page.
In contrast, the #map method for a Watir::ElementCollection will iterate over each of the matching elements. This is what is missing.
You have a couple of options. If you want all the links in the grid, the most straightforward approach is to create a #links or #as element collection:
company_grid = browser.div(id: 'mas-apps-list-tile-grid')
company_hrefs = company_grid.links.map { |a| a.href }
If there are only some links you care about, you'll need to use the link's parents to narrow it down. For example, maybe it's just links located in a "solution-tile-content-container" div:
company_grid = browser.div(id: 'mas-apps-list-tile-grid')
company_listings = company_grid.divs(class: 'solution-tile-content-container')
company_hrefs = company_listings.map { |div| div.a.href }
I'm trying to fetch a span of 4,600 elements
<span> 4,600 </span>
i inspected the elements and found that each element is a list class which has a child class with a title and href that i want to fetch the problem is that :
not all elements are visible , you would have to scroll down the api to find more elements
i cant seem to successfully fetch a single piece of data
puts browser.th(:class => %w("_9irns _pg23k _jpwof _gvoze")).link.hreflang
this is the structure of the code i'm trying to fetch
<ul class = 'xxx'>
<div class = 'xxa'>
<li class ='fff'>
<li class ='fff'>
<li class ='fff'>
.
.
the <li class = 'fff'> has <a class='xxx xxx xxx xxx'> having the data i'm trying to fetch tittle and href
to be more clear how can I iterate over all the classes of 'fff' and pick a url which is in a child class of it.
Don't use quotes inside the %w to find an element from a collection of classes, and try requiring watigiri gem and using #text! to obtain text of hidden elements.
<div class="card-image" style="background-image: url("https://cdn6.bigcommerce.com/s-0kvv9/images/stencil/500x659/products/170691/242554/dicemastgreenflash__36803.1503934716.jpg?c=2");">
</div>
When I input this,
.//*[#class='card-image']/#style
it returns this:
background-image:url('https://cdn6.bigcommerce.com/s-0kvv9/images/stencil/500x659/products/170691/242554/dicemastgreenflash__36803.1503934716.jpg?c=2');
I only want it to return the URL.
This XPath,
substring-before(substring-after(.//*[#class='card-image']/#style, "url('"), ");")
will select only the URL, as requested.
A webpage contains
<div class="divclass">
<ul>
<li>
"hello world 1"
<img src="abc1.jpg">
</li>
<li>
"hello world 2"
<img src="abc2.jpg">
</li>
</ul>
</div>
I am able to get data under div using
element = driver.find_element(class: "divclass")
element.text.split("\n")
But I want all links respective to the achieved data
I tried using
driver.find_elements(:css, "div.divclass a").map(&:text)
but failed.
How can I get related links to the data?
If you want to get the href attribute try the below code(I am not familiar with ruby so I am posting the code in Java).
List<WebElement> elements = driver.findElements(By.xpath("//*[#class='divclass']//a"));
for(WebElement webElement:elements){
System.out.println(webElement.getAttribute("href"));
}
The xpath points to all the a tags under the div tag with class name =divclass.
If you want to get the text of all the links, you can use the blow code:
List<WebElement> elements = driver.findElements(By.xpath("//*[#class='divclass']//a"));
for(WebElement webElement:elements){
System.out.println(webElement.getText());
}
Hope it helps.
In ruby
element = driver.find_elements(:xpath, "//*[#class='divclass']//a")
list = element.collect{|e| hash ={e.text => e.attribute("href")}}
will return corresponding links with data in array of hashes
let say I have DOM like this:
<div id="tabsmenu">
<ul>
<li class="one">foo</li>
<li class="two">baz </li>
</ul>
</div>
and I would like to get the text from <a href> elements:
# desired output: ['#foo', '#baz']
How to do it using xpath and using combination id and element with a specific class within id ?
Already tried:
some_doc.xpath('//a[#id="tabsmenu"]/[#class="ui-tabs-anchor"]/#href')
# select all href tags of any a element that is in id tabsmenu and class attribute ui- tabs-anchor
EDIT - corrected tabmenu into tabsmenu
You're most likely looking for something like this:
//div[#id='tabsmenu']//a[#class='ui-tabs-anchor']/#href
That will get all href attributes that are part of an a tag with the class ui-tabs-anchor and inside a div element with the id tabsmenu.
Also you might want to take a look at this question:
Find out if class name contains certain text
This is because the class will match the exact value (ui-tabs-anchor) and maybe some additional class might be added there such as class="ui-tabs-anchor disabled" and then there will not be a match in there.