Nokogiri next_element with filter

Nokogiri next_element with filter - ruby

Let's say I've got an ill formed html page:
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
On BeautifulSoup, we were able to get the <th> and then call findNext("td"). Nokogiri has the next_element call, but that might not return what I want (in this case, it would return the tr element).
Is there a way to filter the next_element call of Nokogiri? e.g. next_element("td")?
EDIT
For clarification, I'll be looking at many sites, most of them ill formed in different ways.
For instance, the next site might be:
<table>
<th class="what_I_need">Super sweet text<th>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
I can't assume any structure other than there will be trs below the item that has the class what_I_need

First, note that your closing th tag is malformed: <th>. It should be </th>. Fixing that helps.
One way to do it is to use XPath to navigate to it once you've found the th node:
require 'nokogiri'
html = '
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<tr>
</table>
'
doc = Nokogiri::HTML(html)
th = doc.at('th.what_I_need')
th.text # => "Super sweet text"
td = th.at('../../tr/td')
td.text # => "\n I also need this\n "
This is taking advantage of Nokogiri's ability to use either CSS accessors or XPath, and to do it pretty transparently.
Once you have the <th> node, you could also navigate using some of Node's methods:
th.parent.next_element.at('td').text # => "\n I also need this\n "
One more way to go about it, is to start at the top of the table and look down:
table = doc.at('table')
th = table.at('th')
th.text # => "Super sweet text"
td = table.at('td')
td.text # => "\n I also need this\n "
If you need to access all <td> tags within a table you can iterate over them easily:
table.search('td').each do |td|
# do something with the td...
puts td.text
end
If you want the contents of all <td> by their containing <tr> iterate over the rows then the cells:
table.search('tr').each do |tr|
cells = tr.search('td').map(&:text)
# do something with all the cells
end

Related

how to exclude a table inside in another table in xpath?

I have the follow html file:
<table class="pd-table">
<caption> Tech </caption>
<tbody>
<tr data-group="1">
<td> Electrical </td>
<td> Design </td>
<tr data-group="1">
<td> Output </td>
<td> Function </td>
<tr data-group="7">
<td> EMC </td>
<table>
<tbody>
<tr>
<td> EN 6547 ESD </td>
<td> EN 8901 ESD </td>
<tr data-group="8">
<td> Weight [8] </td>
<td> 27.7 </td>
I can isolate EN 6547 ESD and EN 8901 ESD with the follow xpath:
//table[#class="pd-table"]//tbody//tr//td/table//tr//td/text()').getall()
Any other way is always welcome :)
Another data which I would like to get is to get all the rest of the data without the previous isolated.
Is there any way to do it? :)

Looks like table tag is not closed properly in data-group-7...
Anyway in such cases you can stick to text content of the cell using contains() or text()="some exact text"
response.xpath('//td[contains(text(), "EMC")]').css('td~table tbody td::text').extract()

Your used Xpath uses a lot of unwanted double slash.
See meaning of double slash in Xpath.
The less you use double slash, the better it will perform.
So just use single slash like this:
//table[#class="pd-table"]/tbody/tr/td/table/tr/td/text()
Another way of selecting td's that have two ancestor::table
//td[count(ancestor::table)=2]/text()
And that leads to the answer of your second question:
//td[count(ancestor::table)=1]/text()
An other possibility would just be:
//table[#class="pd-table"]/tbody/tr/td/text()
Or(assuming the second tabel does not have tr's with #data-group):
//tr[#data-group]/td/text()
So you see there are many Xpath's lead to Rome ;-).

XPath find text according last word in the string

I need to find the whole text according last word in the string. I have something like this:
<table>
<tr>
<td style='white-space:nowrap;'>
<a href=''>test</a>
</td>
<td>any text</td>
<td>text text texttofind</td>
<td>Not Available</td>
<td class='aui-lozenge aui-lozenge-default'>text</td>
</tr>
<tr>
<td style='white-space:nowrap;'>
<a href=''>test</a>
</td>
<td>any text</td>
<td>text text texttofind2</td>
<td>Not Available</td>
<td class='aui-lozenge aui-lozenge-default'>text</td>
</tr>
<tr>
<td style='white-space:nowrap;'>
<a href=''>test</a>
</td>
<td>any text</td>
<td>text text texttofind3</td>
<td>Not Available</td>
<td class='aui-lozenge aui-lozenge-default'>text</td>
</tr>
</table>
I need to find whole text vallue according last word texttofind
<td>text text texttofind</td>
I cant use contains, because it will find multiple values. I need something like ends-with but I am using xpath 1.0.
I tried something like this, but I am not sure what is wrong because it is not working
//tr[substring(., string-length(#td)
- string-length('texttofind') + 1) = 'texttofind']
or maybe it would be better to use matches?

You're almost there; try changing your xpath expression to
//tr//td[substring(., string-length(.)
- string-length('texttofind') + 1) = 'texttofind']
and see if it works.

Scraping page with correct xpath using Mechanize and nokogiri

I am trying to access data contained in a table that is itself contained in a table with class ='L1'.
So basically my html structure is like this:
<table class="L1">
<table>
<tr></tr>
<tr>
<td></td>
<td>data</td>
</tr>
<tr>
<td></td>
<td>data</td>
</tr>
...ect...ect
</table>
</table>
I need to catch the data contained in a all <a> </a> that are in the second contained in <tr> </tr> but only starting with the second <tr> of the table.
So far I came up with that:
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
But seems to me that this doesn't express the fact that I want to start only after the second <tr> (second <tr> included?
What would be the right code to do this ?

You can use position() to select the later elements that you want.
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td[2]/a[1]")
As the comments on that SO answer say, remember XPath counts from 1, so >1 skips the first tr.

How to select elements when there is a space in an HTML class

How do I use CSS selectors to get the "This is the text I need" line below?
I don't know how to deal with spaces in the table class.
<table class="some name">
<thead>
</thead>
<tbody>
<tr>
<td style="text-align:center;">50</td>
<td style="text-align:left;">This is the text I need</td>

If there are spaces in the class attribute value, that means there are multiple classes applied to the element. To locate an element with multiple classes, the css selector is just a chain of the classes. Generally, the form looks like:
element.class1.class2
Therefore, assuming the link is the first in the table with class "some" and "name", you can do:
require 'nokogiri'
html = %Q{
<table class="some name">
<thead>
</thead>
<tbody>
<tr>
<td style="text-align:center;">50</td>
<td style="text-align:left;">This is the text I need</td>
</tr>
</tbody>
</table>
}
doc = Nokogiri::XML(html)
# Assuming you need both classes to uniquely identify the table
p doc.at_css('table.some.name a').text
#=> "This is the text I need"
# Note that you do not need to use both classes if one of them is unique
p doc.at_css('table.name a').text
#=> "This is the text I need"

Webdriver in Ruby, how to check elements exist

I am using Webdriver in Ruby and I want to verify three text exists on a page.
Here is the piece of html I want to verify:
<table class="c1">
<thead>many subtags</thead>
<tbody id="id1">
<tr class="c2">
<td class="first-child">
<span>test1</span>
</td>
manny other <td></td>
</tr>
<tr class="c2">
<td class="first-child">
<span>test2</span>
</td>
manny other <td></td>
</tr>
<tr class="c2">
<td class="first-child">
<span>test3</span>
</td>
manny other <td></td>>
</tr>
</tbody>
</table>
How do I verify "test1", "test2" and "test3" presents on this page using
find_element
find_elements
getPageSource?
What is the best approach for it?
Thank you very much.

I would go with #find_elements method,because with other 2 options there will be a chance to get no such element exception.
First collect it in an array -
array = driver.find_elements(:xpath,"//table[#class='c1']//tr[#class='c2']//span[text()='test1']")
Now check the array size
"test1 is present" unless array.empty?
The same way we can test for test2 and test3.

Following sample code will help you to perform your task :
def check_element_exists
arr = ["test1", "test2", "test3"]
arr.each do |a|
if $driver.page_source.include? a
puts a
else
print a
puts " Not present"
end
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nokogiri next_element with filter - ruby

Related

how to exclude a table inside in another table in xpath?

XPath find text according last word in the string

Scraping page with correct xpath using Mechanize and nokogiri

How to select elements when there is a space in an HTML class

Webdriver in Ruby, how to check elements exist

Categories

Resources