combining XPATH axes (preceding-sibling & following-sibling) - xpath

Say I have the following UL:
<ul>
<li>barry</li>
<li>bob</li>
<li>carl</li>
<li>dave</li>
<li>roger</li>
<li>steve</li>
</ul>
I need to grab all the LIs between bob & roger. I can grab everything after bob with //ul/li[contains(.,"bob")]/following-sibling::li, and I can grab everything before roger with //ul/li[contains(.,"roger")]/preceding-sibling::li. The problem is when I try to combine the two, I end up getting extra results.
For example, //ul/li[contains(.,"bob")]/following-sibling::li[contains(.,"roger")]/preceding-sibling::li will of course get everything before roger, instead of ignoring the items before bob.
Is there a way to chain these two xpaths together?

Try:
/ul/li[preceding-sibling::li='bob' and following-sibling::li='roger']

Related

how to select 2nd set of <ol> list with xpath

I'm trying to scrape the tips sections of these exercises but a lot of the pages are different resulting in a blank field.
The only thing they all have in common is that the tips are always in the 2nd oldered list. The 1st ordered list is the instructions. 2nd ordered list are the tips.
Here are some of the xpath that I have tried:
//ol (this selects both ordered lists)
//ol[2] (this doesn't work at all for some reason)
//h3[contains(text(),'​Exercise Tips:')]/following::ol (some of the pages it didn't pick up tips section)
//DIV[#class="field field-name-body field-type-text-with-summary field-label-hidden"]/DIV[1]/DIV[1]/OL[2] (again some of the pages it returned blank)
Link to the some of the exercises that the page are different:
https://www.muscleandstrength.com/exercises/one-arm-kettlebell-floor-press
https://www.muscleandstrength.com/exercises/overhead-tricep-extension.html
I would try //div[contains(#class, "exercise-tips")]//li/text().
Based on your example xpaths, you could use //div[contains(#class, "exercise-tips")]//ol if you truly want to select the ol element and not the text of the tips.
//ol[2] doesn't select any nodes because both ol nodes are the first ol child of their respective parents, not the second. (//ol)[2] does work, however. See https://stackoverflow.com/a/52166328/5225301 for more details.

Xpath with contains() in importXML()

I've done dozens times, but now don't get what I'm doing wrong. I want to extract specific records, into 2 separate columns (I know that order wil not match), so I use:
//a/#href[contains(.; "github")]
and
//*[contains(text(); "Pricing:")]
But non of them is working - where my mistake?
(my sandbox: https://docs.google.com/spreadsheets/d/11Z3xybq_eYQvjn2-UBOomgeJxFrrsFoXKzF9yZSeASM/edit#gid=1841586203 with LT localle)
damn, those google sheet localles!!!... must be:
//a/#href[contains(., "github")]
and
//*[contains(text(), "Pricing:")]
I'll keep for further reference.

Trying to exclude a portion of an xPath

I have looked through several posts about this, but have failed to apply the principles used to get the result I desire, so I'm going to just post my specific problem.
I am building a Google Sheet that enables the user to pull up Bible verses.
I have it all working, however I am running into an issue with a hidden element being pulled into my text().
FUNCTION:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()")
RESULT: You shall put out both male and female, putting them outside the camp, that they may not defile their camp, 1in the midst of which I dwell."
You can see the "1" that is showing up before the word "in"
I have found the xPath that pulls only that "1"
//*[#class='scripture']//span[2]//sup//text()
I am trying to remove that "1" from the text.
HELP PLEASE!!! :)
You can add a predicate to the end to exclude text nodes that are inside sup elements:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]")
This will retrieve only the text nodes that are not inside a sup element, but it will still result in having the verse spread out across two cells, because there are two text nodes. You can rectify this by wrapping this expression in a JOIN():
=JOIN("", IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]"))

How to use substring() with Import.io?

I'm having some issues with XPath and import.io and I hope you'll be able to help me. :)
The html code:
<a href="page.php?var=12345">
For the moment, I manage to extract the content of the href ( page.php?var=12345 ) with this:
./td[3]/a[1]/#href
Though, I would like to just collect: 12345
substring might be the solution but it does not seem to work on import.io as I use it...
substring(./td[3]/a[1]/#href,13)
Any ideas of what the problem is?
Thank's a lot in advance!
Try using this for the xpath: (Have the field selected as Text)
.//*[#class='oeil']/a/#href
Then use this for your regex:
([^=]*)$
This will get you the ISBN number you are looking for.
import.io only support functions in XPath when they return a node list
Your path expression is fine, but perhaps it should be
substring(./td[3]/a[1]/#href,14)
"Does not seem to work" is not a very clear description of what is wrong. Do you get error messages? Is the output wrong? Do you have any code surrounding the path expression you could show?
You can use substring, but using substring-after() would be even better.
substring-after(/a/#href,'=')
assuming as input the tiny snippet you have shown:
<a href="page.php?var=12345"/>
will select
12345
and taking into account the structure of your input
substring-after(./td[3]/a[1]/#href,'=')
A leading . in a path expression selects only immediate child td nodes of the current context node. I trust you know what you are doing.

How to ignore first element in xpath

How can I ignore first element and get rest of the elements?
<ul>
<li>some link</li>
<li>some link 2</li>
<li>link i want to find</li>
</ul>
Thanks
if you want to ignore the "first" element only then:
//li[position()>1]
or
(//a)[position()>1]
if you want the last only (like your example seems to suggest):
//li[last()]
or
(//a)[last()]
You can use position() to skip over the "first" one, but depending on which element you are interested in and what the context is, you may need a slight variation on your XPATH.
For instance, if you wanted to address all of the li elements and get all except the first, you could use:
//li[position()>1]
and it would work as expected, returning all of the li elements except for the first.
However, if you wanted to address all of the a elements you need to modify the XPATH slightly. In the context of the expression //a[position()>1] each one of the a elements will have a position() of 1 and last() will evaluate to true. So, it would always return every a and would not skip over the first one.
You need to wrap the expression that selects the a in parenthesis to group them in a node-set, then apply the predicate filter on position.
(//a)[position()>1]
Alternatively, you could also use an expression like this:
//a[preceding::a]
That will find all a elements except the first one (since there is no a preceding the first one).

Resources