Find all preceding sibling nodes until one is found with a specific child node attribute - xpath

I would like to get all table rows after a specific row identifier (an attribute on the row column) until that specific row identifier is found.
Here is the html I'm trying to parse:
<tr>
<td colspan="4">
<h3>Header 1</h3>
</td>
</tr>
<tr>
<td>Item desc - Header 1</td>
<td>more info</td>
<td>30</td>
<td>500</td>
</tr>
<tr>
<td colspan="4">
<h3>Header 2</h3>
</td>
</tr>
<tr>
<td>Item desc - header 2</td>
<td>other</td>
<td>4</td>
<td>49</td>
</tr>
<tr>
<td>Item 2 desc - header 2</td>
<td>other 2</td>
<td>65</td>
<td>87</td>
</tr>
I want to be able to grab the item under header 1 and stop when it finds header 2; then the items under header 2 and stop when it finds a header 3; etc.
Is this possible under xpath? I can't get it to only find the TR nodes until it finds a child node with a specific attribute (of colspan="4").

This is not possible under XPath 1.0. You somehow have to fixate the header tr, because you are trying to find all its following siblings whose first preceding header tr is the original one. Without the reference to the original header, everything is possible. But you probably work in some kind of a language that you can use to remember the value.
For example, in xsh:
for my $x in //tr[td/#colspan="4"] {
echo ($x/td/h3) ;
for $x/following-sibling::tr[count(td)=4
and preceding-sibling::tr[count(td)=1][1]=$x]
echo " " (td) ;
}
Output:
Header 1
Item desc - Header 1 more info 30 500
Header 2
Item desc - header 2 other 4 49
Item 2 desc - header 2 other 2 65 87

This might give you what you're looking for, not the most orthodox means though:
//*/tr/td[not(child::h3)]/ancestor::tr
This will give you all the <td> nodes within a <tr> that isn't a header block.
And you can specify the header with:
//*/tr/td[not(child::h3/text()='Header 1')]/ancestor::tr
Or a more general:
//*/tr/td[not(child::h3[contains(text(),'Header')])]/ancestor::tr

Related

Xpath: Wildcards for descendant nodes not working

Desired output: 3333
<tbody>
<tr>
<td class="name">
<p class="desc">Intel</p>
</td>
</tr>
Other tr tags
<tr>
<td class="tel">
<p class="desc">3333</p>
</td>
</tr>
</tbody>
I want to select the last tr tag after the tr tag that has "Intel" in the p tag
//tbody//tr[td[p[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
The above works but I don't wish to reference td and p explicitly. I tried wildcards ? or *, but it doesn't work.
//tbody//tr[?[?[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
"...which contains a text node equal to 'Intel'"
//tbody/tr[.//text() = 'Intel']/following-sibling::tr[last()]/td/p/text()
"...which contains only the string 'Intel', once you remove all insignificant white-space"
//tbody/tr[normalize-space() = 'Intel']/following-sibling::tr[last()]/td/p/text()
I think the key take-away here is that you can use descendant paths (//) and pay attention to context in predicates once you make them relative (.//).

correct way to scrape this table (using scrapy / xpath)

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

adding two iterative values into an update query asp-classic

I'm sure my error here is simple
I have a 2d Array from my sqlDB on a single column
myList = rs.GetRows()
I have html page with inputs based on the length of my array. :
<input type="number" name="actual">
now, what I'm trying to do is build a unique SQL update query where the actual column matches the unique_id column
lets assume we have only two variables on my list.
for each x in my list
response.write(x)
1,
2
and there are only two inputs as the inputs are generated by the unique ID
for inputs in response.form("actual")
response.write(inputs)
55,
66
now, I want to combine these to build my update query.
I've tried writing a double for loop but this generates an ID for every instance of the input so creating 4 variables instead of 2
Unique ID, Input
1 : 55
1 : 66
2 : 55
2 : 66
what I would like is
1 : 55
2 : 66
is anyone able to help? I've been at this for hours. I'm not a coder or from a technical background and I'm knee deep in legacy systems and processes.
I'm sure a dictionary would be the way to go so I can generated a 1 for 1 relationship but I have no idea how to convert my inputs into a list then pass them into a dict.
html code to generate my table :
<div class="container">
<table id="table" class="table">
<thead class="thead-dark">
<tr>
<th scope="col" data-field="article">Unique ID</th>
<th scope="col" data-field="item">Item Name</th>
<th scope="col" data-field="quant">Quantity</th>
<th scope="col" data-field="act">Actual</th>
</tr>
</tr>
</thead>
</div>
<%
While not grs.eof
%>
<tr>
<th><%=grs.fields("UniqueID")%></th>
<th><%=grs.fields("itemName")%></th>
<th><%=grs.fields("quant")%></th>
<input type="number" class="form-control" id="actual" placeholder="<%=grs.fields("actual")%>" name="Actual">
<%
grs.movenext
Wend
SQL update query goes here %>
Ok, little thing here:
Your id can't be the same for every line, so do something like id=actual_<%=grs.fields("UniqueID")%>
You can try this:
<input type="number" class="form-control" id="actual_<%=grs.fields("UniqueID")%>" placeholder="<%=grs.fields("actual")%>" name="actual_<%=grs.fields("UniqueID")%>">
And then in your loop:
for each inputs in request.form
if left(inputs, 7) = "actual_" then
myId = mid(inputs, 8)
myValue = request.form("actual_" & myId)
<your sql statement here>
end if
next
(You'll have to add something to check the name of the input you're checking is at least 7 chars long or you'll get an error)

XPath: returning the index of specific tag inside a set of tags with the same type

Here is an excerpt of my xml:
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
I know how to find specific <tr> tag.
Is it possible to define <tr> tag index or ordinal number inside the <tbody> tag? I guess, that it's possible to loop through the table, but the table is quite large and it will take lots of time.
Is it possible to get this index/ordinal number with single XPATH statement?
I've used following XPath expression:
//tbody//td[text()='findMe']/../following-sibling::tr
These expression calculates, how many 'tr' nodes are located under the node with 'findMe' text. Actually, it useful, because quantity of 'tr' nodes could be obtained.
But, prior to given XPath, a verification should be made, because in case 'finMe' string would be absent, XPath would return 0. The following expression works as validation fine:
//tbody//td[text()='findMe']

xpath expression to find url and data

i want to get the values of every table and the href value for every within the table given below.
Being new to xpath, i am finding it difficult to write xpath expression.
However understanding what an xpath expression does lies somewhat in an easier category.
the expected output
http://a.com/ data for a 526735 Z
http://b.com/ data for b 522273 Z
http://c.com/ data for c 513335 Z
<table class = dataTabe>
<tbody>
<tr>
<td>data for a</td>
<td class="numericalColumn">526735</td>
<td class="numericalColumn">Z</td></tr>
<tr>
<td>data for b</td>
<td class="numericalColumn">522273</td>
<td class="numericalColumn">B</td></tr>
<tr>
<td>data for c</td>
<td class="numericalColumn">513335</td>
<td class="numericalColumn">B</td></tr>
</tbody>
</table>
You'll need two things: an XPath query which locates the wanted nodes and a second which outputs the text as you want it. Since you don't give more information about the languages you're using I'm putting together some pseudocode:
foreach node in document.select("//table[class='dataTable']//tr[td/a/#HREF]")
write node.select("concat(td/a/#HREF,' ',.)")
This site has a great free tool for building XPath Expressions (XPath Builder):
http://www.bubasoft.net/
Use this XPath: //tr/td/a/#HREF | //tr//text()

Resources