How to use substring() with Import.io? - xpath

I'm having some issues with XPath and import.io and I hope you'll be able to help me. :)
The html code:
<a href="page.php?var=12345">
For the moment, I manage to extract the content of the href ( page.php?var=12345 ) with this:
./td[3]/a[1]/#href
Though, I would like to just collect: 12345
substring might be the solution but it does not seem to work on import.io as I use it...
substring(./td[3]/a[1]/#href,13)
Any ideas of what the problem is?
Thank's a lot in advance!

Try using this for the xpath: (Have the field selected as Text)
.//*[#class='oeil']/a/#href
Then use this for your regex:
([^=]*)$
This will get you the ISBN number you are looking for.
import.io only support functions in XPath when they return a node list

Your path expression is fine, but perhaps it should be
substring(./td[3]/a[1]/#href,14)
"Does not seem to work" is not a very clear description of what is wrong. Do you get error messages? Is the output wrong? Do you have any code surrounding the path expression you could show?
You can use substring, but using substring-after() would be even better.
substring-after(/a/#href,'=')
assuming as input the tiny snippet you have shown:
<a href="page.php?var=12345"/>
will select
12345
and taking into account the structure of your input
substring-after(./td[3]/a[1]/#href,'=')
A leading . in a path expression selects only immediate child td nodes of the current context node. I trust you know what you are doing.

Related

What's wrong with this xpath statement?

Trying to get the color WHITE out of the line of code.
<a href="javascript:void(0)" class="itemAttr current" title="WHITE" data-
value="WHITE"><img src="https://gloimg.rglcdn.com/rosegal/pdm-product-
pic/Clothing/2019/06/05thumb-img/1559762268621192281.jpg"></a>
I've tried this:
color = driver.find_element_by_xpath("""//p[#id="select-attr-
0"]/a[#href="javascript:void(0)"]#title""").click()
I get this error message:
The string
'//p[#id="select-attr-0"]/a[#href="javascript:void(0)"]#title' is not
a valid XPath expression.
What I want is to get "WHITE".
It looks like you are missing a / before the #title attribute. Try this xpath instead:
//p[#id="select-attr-0"]/a[#href="javascript:void(0)"]/#title
In order to get an attribute value of an element, you need to put '/' before the '#title', so the following should work (provided the parent element p is correctly addressed):
//p[#id="select-attr-0"]/a[#href="javascript:void(0)"]/#title
When working with XPATHs, it is often useful to use one of free online testers to get instant path feedback, e.g. this one
Try using the below xpath snippet.
//p[#id='select-attr- 0']//child::a[#value='WHITE']

How to extract items inside a table using scrapy

I want to extract all the functions listed inside the table in the below link : python functions list
I have tried using the chrome developers console to get the exact xpath to be used in the file spider.py as below:
$x('//*[#id="built-in-functions"]/table[1]/tbody//a/#href')
but this returns a list of all href's ( which I think what the xpath expression refers to).
I need to extract the text from here I believe but appending /text() to the above xpath return nothing. Can someone please help me to extract the function names from the table.
I think this should do the trick
response.css('.docutils .reference .pre::text').extract()
a non-exact xpath equivalent of it (but that also works in this case) would be:
response.xpath('//table[contains(#class, "docutils")]//*[contains(#class, "reference")]//*[contains(#class, "pre")]/text()').extract()
Try this:
for td in response.css("#built-in-functions > table:nth-child(4) td"):
td.css("span.pre::text").extract_first()

xpath query url with one folder depth only

I am using this XPath query succesfully:
//div[(#class="result")]//a[contains(#href,"pinterest.com")]/#href
The URL I am using the XPath query (with simple_html_dom.php) is this one here.
Now, I would like to find results for pinterest.com/one-folder-deep-only and exclude all URLs deeper than one directory, like pinterest.com/one-folder-deep-only/this or pinterest.com/one-folder-deep-only/this/this. I have no idea if there is a way to achieve that. Have googled a lot, but not found anything. Maybe my search terms weren't the best.
Do you have any ideas? Thanks for helping me here.
I am testing the query using the Chrome XPath Helper.
"//" is to evaluate all levels/depths. Instead use only one "/" for the "a" query to only evaluate immediate children
//div[(#id="first-result")]/a[contains(#href,"url.com")]/#href
Note use of / instead of // before the "a" tag.
Try below XPath to select #href from required anchors only:
//a[contains(#href, "url.com") and not(contains(substring-after(./#href, 'url.com/'), "/"))]/#href
Solution for XPath 2.0:
//a[contains(#href, "url.com") and count(tokenize(#href, "/"))=2]/#href
Note that if in real HTML source href starts-with "http://url.com" you should specify =4 instead of =2

What's the xpath syntax to get tag names?

I'm using Nokogiri to parse a large XML file. Say I've got the following structure:
<menagerie>
<penguin>Pablo</penguin>
<penguin>Mortimer</penguin>
<bull>Ferdinand</bull>
<aardvark>James Cornelius Madison Humphrey Zophar Handlebrush III</aardvark>
</menagerie>
I can count the non-penguins like this:
xml.xpath('//menagerie//*[not(penguin)]').length // 2
But how do I get a list of the tags, like this? (The exact format isn't important; I just want to visually scan the non-penguins.)
bull
aardvark
Update
This gave me the list I wanted - thanks Oded and TMN and delnan!
xml.xpath('//menageries/*[not(penguin)]').each do |node|
puts node.name()
end
You can use the name() or local-name() XPath function.
See the examples on zvon.
I know it's a bit outdated but you should do: xml.xpath('//meagerie/*[not(penguin)]/name()') as the expression. Note the slash, not the dot. This is how you call methods on the current node in XPath.

XPath Find full HTML element ID from partial ID

I am looking to write an XPath query to return the full element ID from a partial ID that I have constructed. Does anyone know how I could do this? From the following HTML (I have cut this down to remove work specific content) I am looking to extract f41_txtResponse from putting f41_txt into my query.
<input id="f41_txtResponse" class="GTTextField BGLQSTextField2 txtResponse" value="asdasdadfgasdfg" name="f41_txtResponse" title="" tabindex="21"/>
Cheers
You can use contains to select the element:
//*[contains(#id, 'f41_txt')]
Thanks to Thomas Jung I have been able to figure this out. If I use:
//*[contains(./#id, 'f41_txt')]/#id
This will return just the ID I am looking for.
I suggest to not use numbers from Id , when you are composing xpath's using partial id. Those number reprezent DINAMIC elements. And dinamic elements change over the next deploys / releases in the System Under Test.The pourpose is to UNIQUE identify elements.
Using this may be a better option or something like this, yo got the idea:
//input[contains(#id, '_txtResponse')]/#id
It worked for me like below
//*[contains(./#id, 'f41_txt')]

Resources