How to extract items inside a table using scrapy - xpath

I want to extract all the functions listed inside the table in the below link : python functions list
I have tried using the chrome developers console to get the exact xpath to be used in the file spider.py as below:
$x('//*[#id="built-in-functions"]/table[1]/tbody//a/#href')
but this returns a list of all href's ( which I think what the xpath expression refers to).
I need to extract the text from here I believe but appending /text() to the above xpath return nothing. Can someone please help me to extract the function names from the table.

I think this should do the trick
response.css('.docutils .reference .pre::text').extract()
a non-exact xpath equivalent of it (but that also works in this case) would be:
response.xpath('//table[contains(#class, "docutils")]//*[contains(#class, "reference")]//*[contains(#class, "pre")]/text()').extract()

Try this:
for td in response.css("#built-in-functions > table:nth-child(4) td"):
td.css("span.pre::text").extract_first()

Related

xpath query url with one folder depth only

I am using this XPath query succesfully:
//div[(#class="result")]//a[contains(#href,"pinterest.com")]/#href
The URL I am using the XPath query (with simple_html_dom.php) is this one here.
Now, I would like to find results for pinterest.com/one-folder-deep-only and exclude all URLs deeper than one directory, like pinterest.com/one-folder-deep-only/this or pinterest.com/one-folder-deep-only/this/this. I have no idea if there is a way to achieve that. Have googled a lot, but not found anything. Maybe my search terms weren't the best.
Do you have any ideas? Thanks for helping me here.
I am testing the query using the Chrome XPath Helper.
"//" is to evaluate all levels/depths. Instead use only one "/" for the "a" query to only evaluate immediate children
//div[(#id="first-result")]/a[contains(#href,"url.com")]/#href
Note use of / instead of // before the "a" tag.
Try below XPath to select #href from required anchors only:
//a[contains(#href, "url.com") and not(contains(substring-after(./#href, 'url.com/'), "/"))]/#href
Solution for XPath 2.0:
//a[contains(#href, "url.com") and count(tokenize(#href, "/"))=2]/#href
Note that if in real HTML source href starts-with "http://url.com" you should specify =4 instead of =2

Is it possible to use Following and preceding in combination in Selenium?

On this page
https://en.wikipedia.org/wiki/Trinity_Seven#Episode_list
I have:
//*[text()='Reception']//preceding::th[contains(#id, 'ep')]//following::I
But it only registers following.
The default firepath selector is: .//*[#id='mw-content-text']/div/table[5]/tbody/tr/td[1]/I but this kind of selector is known to break quite frequently. Just wondering if there is a better way of doing this and I thought this might be a way.
Thanks!
:)
- You can see that it's getting stuff under the table which is not what I want :S
Try to use below XPath to match required elements:
//th[contains(#id, 'ep')]/following::I[./following::*[text()='Reception']]
This looks more simple
//tr[contains(#class, 'vevent')]//i
Don't overcomplicate things. You need I tag inside each row. So just find row locator tr[contains(#class, 'vevent')] and get it's I
Another good approach in case you want to check that inside of parent element is located some special element, but you want to find some 3rd element is to use such style: //element[./specific]//child , so in your case:
//tr[contains(#class, 'vevent')][./th[contains(#id,'ep')]]//i
so it's I tag inside row that contains #id,'ep' in header

How to use substring() with Import.io?

I'm having some issues with XPath and import.io and I hope you'll be able to help me. :)
The html code:
<a href="page.php?var=12345">
For the moment, I manage to extract the content of the href ( page.php?var=12345 ) with this:
./td[3]/a[1]/#href
Though, I would like to just collect: 12345
substring might be the solution but it does not seem to work on import.io as I use it...
substring(./td[3]/a[1]/#href,13)
Any ideas of what the problem is?
Thank's a lot in advance!
Try using this for the xpath: (Have the field selected as Text)
.//*[#class='oeil']/a/#href
Then use this for your regex:
([^=]*)$
This will get you the ISBN number you are looking for.
import.io only support functions in XPath when they return a node list
Your path expression is fine, but perhaps it should be
substring(./td[3]/a[1]/#href,14)
"Does not seem to work" is not a very clear description of what is wrong. Do you get error messages? Is the output wrong? Do you have any code surrounding the path expression you could show?
You can use substring, but using substring-after() would be even better.
substring-after(/a/#href,'=')
assuming as input the tiny snippet you have shown:
<a href="page.php?var=12345"/>
will select
12345
and taking into account the structure of your input
substring-after(./td[3]/a[1]/#href,'=')
A leading . in a path expression selects only immediate child td nodes of the current context node. I trust you know what you are doing.

Handling Dynamic Xpath

Am automating things using Selenium. Need your help to handle Dynamic Xpath as below:
Driver.findElement(By.xpath("//[#id='INQ_2985']/div[2]/tr/td/div/div[3]/div")).click();
As above INQ_2985 changes to 2986,2987,2988 etc during each run
HTML CODE:
< div> class="context-menu-item-inner" style="background-image:url(../images/productSmall.png);">Tender Assignment < /div>
Tried different combinations as below but with no success:
// Driver.findElement(By.name("//input[#name='Tender Assignment']")).click();
// Driver.findElement(By.className("context-menu-item-inner")).click();`
Can you help me on this.
you can try using contains() or starts-with() in xpath,
above xpath can be rewritten as follows,
Driver.findElement(By.xpath("//*[starts-with(#id,'INQ')]/div[2]/tr/td/div/div[3]/div")).click();
if you can post more of your html, we can help improve your xpath..
moreover using such long xpath's is not recommended, this may cause your test to fail more often
for example,if a "new table data or div" is added to the UI, above xpath will no longer be valid
you should try and use id, class or other attributes to get closer to the element your trying to find
i personally recommend using cssSelectors over xpath
you can use many methods,
use implicity wait;
driver.findElement(By.xpath("//*[contains(#id,'select2-result-label-535')]").click();
driver.findElement(By.xpath("//*[contains(text(), 'select2-result-label-535')]").click();
Good to use Regular expression
driver.findElement(By.xpath("//*[contains(#id,'INQ_')]")
Note: If you have single ID with name starts from INQ_ then you can take action on the element . If a bunch of ID then you can extract as a List<WebElements> and then match with the specific text of the element ( element.getText().trim() =="Linked Text" and if it matched then take action. You can follow other logic to traverse and match.
you can use css -
div.context-menu-item-inner
Use this xpath:
driver.findElement(By.cssSelector("div.context-menu-item-inner").click();
The best choice is using full xpath instead of id which you can get easily via firebug.
e.g.
/html/body/div[3]/div[3]/div[2]/div/div[2]/div[1]/div/div[1]
if your xpath is varying
Ex: "//*[#id='msg500']" , "//*[#id='msg501']", "//*[#id='msg502']" and so on...
Then use this code in script:
for (int i=0;i<=9;i++) {
String mpath= "//*[#id='msg50"+i+"']";
driver.findElement(By.xpath(mpath)).click();
}

Xpath Multiple Predicates

I am trying to quickly find a specific node using XPath but it seems my multiple predicates are not working. The div I need has a specific class, but there are 3 others that have it. I want to select the fourth one so I did the following:
//div[#class='myCLass' and 4]
However the "4" is being ignored. Any help? I am new to XPath.
Thanks.
If a xpath query returns a node set you can always use the [OFFSET] operator to access a certain element of it.
Use the following query to access the fourth element that matches the #class='myClass' predicate:
//div[#class='myCLass'][4]
#WilliamNarmontas answer might be an alternative to the syntax showed above.
Alternatively,
//div[#class='myCLass' and position()=4]
The accepted answer works correctly only if all of the div elements have the same parent. Otherwise use:
(//div[#class='myCLass'])[4]

Resources