XPath: get a node's sibling's string content

XPath: get a node's sibling's string content - xpath

Assume the following HTML DOM:
<table id="my_id">
<tbody>
<tr>
<th>Location:</th>
<td>
1600 Parkway Ave
Los Angels
California
</td>
</tr>
</tbody>
</table>
How do I get '1600 Parkway Ave Los Angels California' assuming there are a lot of <tr> in this table? I think I need to get the sibling of the <th> that contains Location:. I've been trying to do something like:
//*[#id="my_id"]//th[Text()='Location:']

You could move the th predicate up to tr...
//*[#id="my_id"]//tr[th='Location:']/td/text()
Update based on comment The "Location:" string actually has whitespace around it.:
//*[#id="my_id"]//tr[normalize-space(th)='Location:']/td/text()

As simple as:
//*[#id="my_id"]//th[text()='Location:']/../td/text()

Related

how to exclude a table inside in another table in xpath?

I have the follow html file:
<table class="pd-table">
<caption> Tech </caption>
<tbody>
<tr data-group="1">
<td> Electrical </td>
<td> Design </td>
<tr data-group="1">
<td> Output </td>
<td> Function </td>
<tr data-group="7">
<td> EMC </td>
<table>
<tbody>
<tr>
<td> EN 6547 ESD </td>
<td> EN 8901 ESD </td>
<tr data-group="8">
<td> Weight [8] </td>
<td> 27.7 </td>
I can isolate EN 6547 ESD and EN 8901 ESD with the follow xpath:
//table[#class="pd-table"]//tbody//tr//td/table//tr//td/text()').getall()
Any other way is always welcome :)
Another data which I would like to get is to get all the rest of the data without the previous isolated.
Is there any way to do it? :)

Looks like table tag is not closed properly in data-group-7...
Anyway in such cases you can stick to text content of the cell using contains() or text()="some exact text"
response.xpath('//td[contains(text(), "EMC")]').css('td~table tbody td::text').extract()

Your used Xpath uses a lot of unwanted double slash.
See meaning of double slash in Xpath.
The less you use double slash, the better it will perform.
So just use single slash like this:
//table[#class="pd-table"]/tbody/tr/td/table/tr/td/text()
Another way of selecting td's that have two ancestor::table
//td[count(ancestor::table)=2]/text()
And that leads to the answer of your second question:
//td[count(ancestor::table)=1]/text()
An other possibility would just be:
//table[#class="pd-table"]/tbody/tr/td/text()
Or(assuming the second tabel does not have tr's with #data-group):
//tr[#data-group]/td/text()
So you see there are many Xpath's lead to Rome ;-).

Scraping page with correct xpath using Mechanize and nokogiri

I am trying to access data contained in a table that is itself contained in a table with class ='L1'.
So basically my html structure is like this:
<table class="L1">
<table>
<tr></tr>
<tr>
<td></td>
<td>data</td>
</tr>
<tr>
<td></td>
<td>data</td>
</tr>
...ect...ect
</table>
</table>
I need to catch the data contained in a all <a> </a> that are in the second contained in <tr> </tr> but only starting with the second <tr> of the table.
So far I came up with that:
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
But seems to me that this doesn't express the fact that I want to start only after the second <tr> (second <tr> included?
What would be the right code to do this ?

You can use position() to select the later elements that you want.
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td[2]/a[1]")
As the comments on that SO answer say, remember XPath counts from 1, so >1 skips the first tr.

XPATH for tr that has multiple text's in side it

I need XPATH for <tr> that contains text 'abc' in second <td> and text 'xyz' in third <td>
Tried but no luck.

final String XPATH = "//tr[td[contains(.,'abc')] and td[contains(.,'xyz')]";
Your expression actually almost selects what you want (once you fix the missing last ]). You just need to specify positions of the <td> elements.
//tr[td[2][contains(.,'abc')] and td[3][contains(.,'xyz')]]
For the following XML document:
<document>
<table>
<tr>
<td>foo</td>
<td>abc</td>
<td>xyz</td>
</tr>
<tr>
<td>foo</td>
<td>bar</td>
<td>xyz</td>
</tr>
</table>
</document>
this returns a node-set with the first <tr> element of the document in document order.

Import data from HTML page using feeds importer in drupal

I'm trying to import some data from a HTML page with feeds importer. The context is this:
<table class="tabela">
<tr valign="TOP">
<td class="formulario-legenda">Nome:</td>
<td nowrap="nowrap">
<b>Raul Fernando de Almeida Moreira Vidal</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Sigla:</td>
<td>
<b>RMV</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Código:</td>
<td>206415</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Estado:</td>
<td>Ativo</td>
</tr>
</table>
<table>
<tr>
<td class="topo">
<table>
<tr>
<td class="formulario-legenda">Categoria:</td>
<td>Professor Associado</td>
</tr>
<tr>
<td class="formulario-legenda">Carreira:</td>
<td>Pessoal Docente de Universidades</td>
</tr>
<tr>
<td class="formulario-legenda">Grupo profissional:</td>
<td>Docente</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Departamento:</td>
<td>
<a href="uni_geral.unidade_view?pv_unidade=151"
title="Departamento de Engenharia Informática">Departamento de Engenharia Informática</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
I tried with this:
/html/body/div/div/div/div/div/div/div/table/tbody/tr/td/table/tbody/tr[1]/td[2]
but nothing appears. Can someone help me with the right syntax to obtain "Grupo Profissional"?

Quick answer that might work
Considering just the HTML sample you provided (which only has two tables) you can select the text you want using this expression, based on the table's position:
//table[2]//tr[3]/td[1]/text()
This will work in the HTML you pasted above. But it might not work in your actual scenario, since you might have other tables, the table you want to select has no ID and you didn't suggest some invariant text in your code which could be used to anchor the context for the expression. Assuming the initial part of your XPath expression (the div sequence) is correct, you might be able to use:
/html/body/div/div/div/div/div/div/div/table[2]//tr[3]/td[1]/text()
But it's wuite a fragile expression and vulnerable to any changes in the document.
A (possibly) better solution
A better alternative is to look for some identifier you could use. I can only guess, since I don't know your code. In your sample code, I would guess that Codigo and the number following it 206415 might be some identifier. If it is, you could use it to anchor your context. First you select it:
//table[.//td[text()='Código:']/following-sibling::td='206415']
The expression above will select the table which contains a td with the exact text Código: followed by a td containing the exact text 206415. This will create a unique context (considering that the number is an unique identifier). From that context, you can now select the text you want, which is inside the next table (following-sibling::table[1]). This is the context of the second table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]
And this should select the text you want (Grupo profissional:) which is in the third row tr[3] and first cell/column td[1] of that table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]//tr[3]/td[1]/text()

xpath - how to find info

Given this HTML:
<tr class="even" id="district_22">
<td class="name">Virginia Beach City Public Schools</td>
<td class="">Delete</td>
</tr>
<tr class="even" id="district_23">
<td class="name">Virginia City City Public Schools</td>
<td class="">Delete</td>
</tr>
<tr class="even" id="district_24">
<td class="name">Virginia Town City Public Schools</td>
<td class="">Delete</td>
</tr>
I am trying to use Selenium and xpath with it.
I am having problems when trying to select the 'delete' link that belongs to 'Virginia Beach City Public Schools'.
I am new to xpath.
I am trying:
xpath=(//td[text()='Beach')]/#class.contains('delete'))
but it is not finding the element.
Note: I cannot use the ID as these are repeated tests and the ID changes each time.

Try this:
//td[contains(text(),'Beach')]/../td/a[contains(#class,'delete_link')]

tr[#id="district_22"]//a[contains(#class,'delete_link')] would be a lot better.
It's not good to look at the text. After all it may get localized and edited in other ways. ID's however are ment to be unchanging and not duplicated.

I think you want to execute a automate scripts in loop, if this is the case then you can try the below code:
for(i=1,dist=22; i<-count; i++,dist++)
{
....
....
driver..findElement(By.xpath("//*[#id=District_"+dist+"]/..."))
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

XPath: get a node's sibling's string content - xpath

You could move the th predicate up to tr... //[#id="my_id"]//tr[th='Location:']/td/text() Update based on comment The "Location:" string actually has whitespace around it.: //[#id="my_id"]//tr[normalize-space(th)='Location:']/td/text()

As simple as: //*[#id="my_id"]//th[text()='Location:']/../td/text()

Related

how to exclude a table inside in another table in xpath?

Scraping page with correct xpath using Mechanize and nokogiri

XPATH for tr that has multiple text's in side it

Import data from HTML page using feeds importer in drupal

xpath - how to find info

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

XPath: get a node's sibling's string content - xpath

You could move the th predicate up to tr... //*[#id="my_id"]//tr[th='Location:']/td/text() Update based on comment The "Location:" string actually has whitespace around it.: //*[#id="my_id"]//tr[normalize-space(th)='Location:']/td/text()

As simple as: //*[#id="my_id"]//th[text()='Location:']/../td/text()

Related

how to exclude a table inside in another table in xpath?

Scraping page with correct xpath using Mechanize and nokogiri

XPATH for tr that has multiple text's in side it

Import data from HTML page using feeds importer in drupal

xpath - how to find info

Categories

Resources

You could move the th predicate up to tr... //[#id="my_id"]//tr[th='Location:']/td/text() Update based on comment The "Location:" string actually has whitespace around it.: //[#id="my_id"]//tr[normalize-space(th)='Location:']/td/text()