I am trying to use Selenium. The problem is the following:
The doc structure:
<div class="jsSkills oSkills">
<a class="oTag oTagSmall oSkill" href="/contractors/skill/software-testing/" data-contractor="749244">software-testing</a>
<a class="oTag oTagSmall oSkill" href="/contractors/skill/software-qa-testing/" data-contractor="749244">software-qa-testing</a>
<a class="oTag oTagSmall oSkill" href="/contractors/skill/blog-writing/" data-contractor="749244">blog-writing</a>
</div>
I need to obtain all a's text to be in array like:
{"software-testing", "software-qa-testing", "blog-writing"}
I tried this:
contrSkill = driver.find_element(:xpath, "//div[contains(#class, 'jsSkills')]").text
puts contrSkill
but got this:
"software-testingsoftware-qa-testingblog-writing"
Please explain how to appropriately make an array.
You should get all of the link elements you want (using find_elements). Then you can iterate over each link and collect its text into an array (Ruby has a collect method that helps with this).
# Get all of the link elements within the div
skill_links = driver.find_elements(:xpath, "//div[contains(#class, 'jsSkills')]/a")
# Create an array of the text of each link
skill_text_array = skill_links.collect(&:text)
p skill_text_array
#=> ["software-testing", "software-qa-testing", "blog-writing"]
Related
I am working on the scraping project and I am facing the big problem that I can't get the text "alt" in "img" tag.
the code is looking like this.
<div class="example">
<span class="on">
<img src="https://www.~~~~~~~~" alt="hello">
</span>
<span class="on">
<img src="https://www.~~~~~~~~" alt="goodbye">
</span>
<span class="on">
<img src="https://www.~~~~~~~~" alt="konichiwa">
</span>
</div>
what I have tried are these
def fetch_text_in_on_class
# #driver.find_elements(:class_name, 'on')[2].text or this ↓
# #driver.find_elements(:css, 'div.pc-only:nth-of-type(3) tr:nth-of-type(3)').first.text
end
also something like this
def fetch_text_in_on_class
e = #driver.find_elements(:class => 'on').first&.attribute("alt")
e
end
there are bunch of elements that have "on" class in a page, and I want to get all of them.
apparently I can get the elements that have "on" class with the code below but I can't get the text in alt.
#driver.find_elements(:class => 'on')
I would really appreciate if you could help me.
Thank you.
Forgive me if my ruby syntax is incorrect or I'm not answering your actual question -- you want the alt text itself?. What if you identify the elements with class of "on" as an array, then loop through to retrieve the related alt text. So, something like this?
elements = #driver.find_elements(:css => 'span.on > img')
elements.each { |element|
altText = element.attribute("alt")
#whatever you want to do with the alt text, or store as an array above etc
}
It looks like the problem is that you are trying to get the alt text from the element with the class on. Given your posted HTML that element doesn't have the alt attribute. Try the CSS selector span.on > img to get the IMG tag and then get the alt text. An updated version of your code to get the text of the first element should work.
e = #driver.find_elements(:css => 'span.on > img').first&.attribute("alt")
Let's iterate over the collection and save the text in an array and then let's print it
elements = driver.find_elements(:css => 'span.on > img').map { |element| element.attribute("alt") }
p elements
Output
["hello", "goodbye", "konichiwa"]
I have a variable e which stores a Nokogiri::XML::Element object.
when I execute puts e I get on the screen the following:
<h3 class="fixed-recipe-card__h3">
<a href="https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/" data-content-provider-id="" data-internal-referrer-link="hub recipe" class="fixed-recipe-card__title-link">
<span class="fixed-recipe-card__title-link">Chocolate Covered Strawberries</span>
</a>
</h3>
I would like to scrape this part https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/
How can I do this using Nokogiri
If you want to extract the link, you can use:
e.at_css("a").attributes["href"].value
.at_css returns the first element matching the CSS selector (another Nokogiri::XML::Element). To get a list of all matching elements, use .css instead.
.attributes gives you a hash mapping attribute name to Nokogiri::XML::Attr. Once you look up the desired attribute in this hash (href), you can call .value to get the actual text value.
Let's say I want to scrape the "Weight" attribute from the following content on a website:
<div>
<h2>Details</h2>
<ul>
<li><b>Height:</b>6 ft</li>
<li><b>Weight:</b>6 kg</li>
<li><b>Age:</b>6</li>
</ul>
</div>
All I want is "6 kg". But it's not labeled, and neither is anything around it. But I know that I always want the text after "Weight:". Is there a way of selecting an element based on the text near it or in it?
In pseudocode, this is what it might look like:
require 'selenium-webdriver'
require 'nokogiri'
doc = parsed document
div_of_interest = doc.div where text of h2 == "Details"
element_of_interest = <li> element in div_of_interest with content that contains the string "Weight:"
selected_text = (content in element) minus ("<b>Weight:</b>")
Is this possible?
You can write the following code
p driver.find_elements(xpath: "//li").detect{|li| li.text.include?'Weight'}.text[/:(.*)/,1]
output
"6 kg"
My suggestion is to use WATIR which is wrapper around Ruby Selenium Binding where you can easily write the following code
p b.li(text: /Weight/).text[/:(.*)/,1]
Yes.
require 'nokogiri'
Nokogiri::HTML.parse(File.read(path_to_file))
.css("div > ul > li")
.children # get the 'li' items
.each_slice(2) # pair a 'b' item and the text following it
.find{|b, text| b.text == "Weight:"}
.last # extract the text element
.text
will return
"6 kg"
You can locate the element through pure xpath: use the contains() function which returns Boolean is its second argument found in the first, and pass to it text() (which returns the text of the node) and the target string.
xpath_locator = '/div/ul/li[contains(text(), "Weight:")]'
value = driver.find_element(:xpath, xpath_locator).text.partition('Weight:').last
Then just get the value after "Weight:".
Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,
We have a page objects elements like
link (:test_link, xpath: './/a[#id = '3'])
unordered_list (:list, id: 'test')
And the code:
def method(elementcontainer, elementlink)
elementcontainer = elementcontainer.downcase.gsub(' ', '_')
elementlink = elementlink.downcase.gsub(' ', '_')
object = send("#{elementcontainer}_element")
object2 = send("#{elementlink}_element")
total_results_1 = object.element.links(id: '3')]").length
total_results_2 = object.element.links(object2).length
end
The last 2 lines contain the mystery.
The total_results_1 is able to get the number of links contained in the unordered list that have id = '3'.
total_results_2 does not work (of course). I don´t want to write in the middle of the code, again, the identification of the links. That is done in the page object.
How it is possible to write something like the total_results_2 line, but in a working version?
I might be misunderstanding the question, but I do not believe you need to create a method for what you want. It can all be done using the page object accessors.
Say we have the following page (I matched this to your accessors, though it seems unlikely that all links would have the same id):
<html>
<body>
<a id="3" href="#">1</a>
<ul id="test">
<li><a id="3" href="#">2</a></li>
<li><a id="3" href="#">3</a></li>
<li><a id="3" href="#">4</a></li>
</ul>
<a id="3" href="#">5</a>
</body>
</html>
As you did, you could define the list with the accessor:
unordered_list(:list, id: 'test')
To get the links with id 3, but are only within the list, you could:
Define the links as a collection - ie use links instead of link.
Use a block to locate the elements. This would allow you to consider the element nesting - ie locate links within the list element.
This would be done with:
links(:test_link){ list_element.link_elements(:id => '3') }
All together, your page object would be:
class MyPage
include PageObject
unordered_list(:list, id: 'test')
links(:test_link){ list_element.link_elements(:id => '3') }
end
To find the number of links, you would access the element collection and check its length.
browser = Watir::Browser.new
browser.goto('your_test_page.htm')
page = MyPage.new(browser)
puts page.test_link_elements.length
#=> 3