How to scarpe the href using Nokogiri - ruby

I have a variable e which stores a Nokogiri::XML::Element object.
when I execute puts e I get on the screen the following:
<h3 class="fixed-recipe-card__h3">
<a href="https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/" data-content-provider-id="" data-internal-referrer-link="hub recipe" class="fixed-recipe-card__title-link">
<span class="fixed-recipe-card__title-link">Chocolate Covered Strawberries</span>
</a>
</h3>
I would like to scrape this part https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/
How can I do this using Nokogiri

If you want to extract the link, you can use:
e.at_css("a").attributes["href"].value
.at_css returns the first element matching the CSS selector (another Nokogiri::XML::Element). To get a list of all matching elements, use .css instead.
.attributes gives you a hash mapping attribute name to Nokogiri::XML::Attr. Once you look up the desired attribute in this hash (href), you can call .value to get the actual text value.

Related

xPath - Why is this exact text selector not working with the data test id?

I have a block of code like so:
<ul class="open-menu">
<span>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text Here</strong>
<small>...</small>
</div>
</li>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text</strong>
<small>...</small>
</div>
</li>
</span>
</ul>
I'm trying to select a menu item based on exact text like so in the dev tools:
$x('.//*[contains(#data-testid, "menu-item") and normalize-space() = "Text"]');
But this doesn't seem to be selecting the element. However, when I do:
$x('.//*[contains(#data-testid, "menu-item")]');
I can see both of the menu items.
UPDATE:
It seems that this works:
$x('.//*[contains(#class, "menu-item") and normalize-space() = "Text"]');
Not sure why using a class in this context works and not a data-testid. How can I get my xpath selector to work with my data-testid?
Why is this exact text selector not working
The fact that both li elements are matched by the XPath expression
if omitting the condition normalize-space() = "Text" is a clue.
normalize-space() returns ... Text Here ... for the first li
in the posted XML and ... Text ... for the second (or some other
content in place of ... from div/svg or div/small) causing
normalize-space() = "Text" to fail.
In an update you say the same condition succeeds. This has nothing to
do with using #class instead of #data-testid; it must be triggered
by some content change.
How can I get my xpath selector to work with my data-testid?
By testing for an exact text match in the li's descendant strong
element,
.//*[#data-testid = "menu-item" and div/strong = "Text"]
which matches the second li. Making the test more robust is usually
in order, e.g.
.//*[contains(#data-testid,"menu-item") and normalize-space(div/strong) = "Text"]
Append /div/small or /descendant::small, for example, to the XPath
expression to extract just the small text.
data-testid="menu-item" is matching both the outer li elements while text content you are looking for is inside the inner strong element.
So, to locate the outer li element based on it's data-testid attribute value and it's inner strong element text value you can use XPath expression like this:
//*[contains(#data-testid, "menu-item") and .//normalize-space() = "Text"]
Or
.//*[contains(#data-testid, "menu-item") and .//*[normalize-space() = "Text"]]
I have tested, both expressions are working correctly

Can't get containts from xpath codeception

I have element
<a href="/s-xQ6qeR/documents/download?revid=28">
<span class="icon icon-file-pdf-o" style="vertical-align: middle"></span> test_upload_uwfacjtn.pdf
</a>
I need to check this element on page
I try do it:
$fileHref = $this->I->grabAttributeFrom("//a[contains(., 'test_upload_uwfacjtn.pdf')]", 'href');
But I got error:
Step Grab attribute from "//a[contains(.,
'test_upload_uwfacjtn.pdf')]","href" Fail Element that matches CSS
or XPath element with '//a[contains(., 'test_upload_uwfacjtn.pdf')]'
was not found.
I finded two way to check the text inside an html tag :
1. Using the method grabAttributeFrom and then compare the result
$fileName = $I->grabTextFrom('//a[#href="/s-xQ6qeR/documents/download?revid=28"]/span');
$I->assertEquals('test_upload_uwfacjtn.pdf', $fileName);
This can be usefull if you want to put the result inside a variable and use it for other tests later.
2. Using method seeElement with the text to compare inside your xpath
$I->seeElement('//span[text()="test_upload_uwfacjtn.pdf"]');

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

Xpath get text of nested item not working but css does

I'm making a crawler with Scrapy and wondering why my xpath doesn't work when my CSS selector does? I want to get the number of commits from this html:
<li class="commits">
<a data-pjax="" href="/samthomson/flot/commits/master">
<span class="octicon octicon-history"></span>
<span class="num text-emphasized">
521
</span>
commits
</a>
</li
Xpath:
response.xpath('//li[#class="commits"]//a//span[#class="text-emphasized"]//text()').extract()
CSS:
response.css('li.commits a span.text-emphasized').css('::text').extract()
CSS returns the number (unescaped), but XPath returns nothing. Am I using the // for nested elements correctly?
You're not matching all values in the class attribute of the span tag, so use the contains function to check if only text-emphasized is present:
response.xpath('//li[#class="commits"]//a//span[contains(#class, "text-emphasized")]//text()')[0].strip()
Otherwise also include num:
response.xpath('//li[#class="commits"]//a//span[#class="num text-emphasized"]//text()')[0].strip()
Also, I use [0] to retrieve the first element returned by XPath and strip() to remove all whitespace, resulting in just the number.

Put the Xpath element's text to array

I am trying to use Selenium. The problem is the following:
The doc structure:
<div class="jsSkills oSkills">
<a class="oTag oTagSmall oSkill" href="/contractors/skill/software-testing/" data-contractor="749244">software-testing</a>
<a class="oTag oTagSmall oSkill" href="/contractors/skill/software-qa-testing/" data-contractor="749244">software-qa-testing</a>
<a class="oTag oTagSmall oSkill" href="/contractors/skill/blog-writing/" data-contractor="749244">blog-writing</a>
</div>
I need to obtain all a's text to be in array like:
{"software-testing", "software-qa-testing", "blog-writing"}
I tried this:
contrSkill = driver.find_element(:xpath, "//div[contains(#class, 'jsSkills')]").text
puts contrSkill
but got this:
"software-testingsoftware-qa-testingblog-writing"
Please explain how to appropriately make an array.
You should get all of the link elements you want (using find_elements). Then you can iterate over each link and collect its text into an array (Ruby has a collect method that helps with this).
# Get all of the link elements within the div
skill_links = driver.find_elements(:xpath, "//div[contains(#class, 'jsSkills')]/a")
# Create an array of the text of each link
skill_text_array = skill_links.collect(&:text)
p skill_text_array
#=> ["software-testing", "software-qa-testing", "blog-writing"]

Resources