How to extract links with xpath expression - xpath

This is the html markup I want to extract links from:
<div id="target">
chotosite
bit2lead
</div>
This is my xpath expression and scrapy code
links = response.xpath("//div[#id='target']/a/#href")
for link in links:
print(link)
What I expect to find is
https://www.chotosite.com
https://www.bit2lead.com
But what I found on console:
<Selector xpath="//div[#id='target']/a/#href" data='https://www.chotosite.com'>
<Selector xpath="//div[#id='target']/a/#href" data='https://www.bit2lead.com'>
How do I solve this problem ?

hope its help you.
In [18]: selector = scrapy.Selector(text="""<div id="target">
...: chotosite
...: bit2lead
...: </div>""")
In [20]: [link.xpath('#href').extract_first() for link in selector.xpath('//div/a')]
Out[20]: ['https://www.chotosite.com', 'https://www.bit2lead.com']

In order to get your desired results:
links = response.xpath("//div[#id='target']/a/#href").extract()
.extract() will create an array
Use response.xpath not xpath alone
now you can loop through your links
for link in links:
print(link)

Related

extract Xpath for string in a div class

I have the below XPath
<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>
I would like to extract "GGRM.JK"from the HTML.
//div[contains(#class, "symbol")]
return element not no text of "GGRM.JK"
Since it seems you are using python, try the following:
import lxml.html as lh
data = """[your html above]"""
doc = lh.fromstring(data)
#version 1
target = doc.xpath('//div[contains(#class, "symbol")]/#class')[0]
print(target.split("'")[1])
#version 2
target2 = doc.xpath('//div[contains(#class, "symbol")]/a/#href')[0]
target2.split('=')[1]
In either case, the output should be
GGRM.JK
The shortest way to get the substing you want with xpath only, without postprocessing, is to use a functions substring-after and substring-before.
Here is an example, how to get 'GGRM.JK' from both class and href attributes.
import lxml.html as lh
htmlText = """<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>"""
htmlDom = lh.fromstring(htmlText)
fromHref = htmlDom.xpath('substring-after(//div/a/#href, "=")')
print(fromHref)
fromClass = htmlDom.xpath('substring-before(substring-after(//div/#class, ": \'"), "\'")')
print(fromClass)

XPath problem with multiple OR expressions like (a|b|c) [duplicate]

This question already has an answer here:
Logical OR in XPath? Why isn't | working?
(1 answer)
Closed 1 year ago.
I have simplified html:
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
I want to find only one and two, using conditions that the parent tag is main or support, and there is span or divafter it.
I wonder why that code does not work:
import lxml.html as HTML_PARSER
html = """
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
"""
parent = '//main | //support'
child = '/span | /div'
doc = HTML_PARSER.fromstring(html)
print doc
xpath = '(%s)(%s)' % (parent, child)
print xpath
parsed = doc.xpath(xpath)
print parsed
I get an error Invalid expression. Why?
This (//main | //support) and this (/span | /div) xpaths are both correct.
Simple combo like (//main | //support)/span is also correct.
But why more complicated combination (//main | //support)(/span | /div) is not correct? How to resolve it?
In my real case //main, //support, /span and /div are really complicated xpaths, I want some general solution like (xpath1 | xpath2)(xpath3 | xpath4)
this will find it, however I'm not 100% sure if it's what you want:
//*[name() = 'main' or name() = 'support']/*[name() = 'span' or name() = 'div']/text()
Your XPath is not valid for XPath version 1 (the one that lxml use)
Try
xpath = '//div[parent::support]|//span[parent::main]'
or
parent = ['main', 'support']
child = ['span', 'div']
xpath = '//*[self::{0[0]} or self::{0[1]}]/*[self::{1[0]} or self::{1[1]}]'.format(parent, child)
You can use the self:: axis:
(//main | //support)[*[self::div or self::span]]

Xpath for //a class

I am trying to find xpath for Template 1. So far I have tried following but nothing works. I am trying to get Template 1
//a[#class='treenode' and contains(text(),'Template 1')]
//div[#id='objTree~templates~750764_children' and text()='Template 1']
//a[contains(#href, 'javascript:void(0)')]/text()
//a[contains(text(), 'Template 1')]
//a[#class='treenode' and starts-with(#href, '/javascript/')]
//a[text()="Template 1"]
//a[normalize-space(.) = 'Template 1']
Here is my HTML
<a class="treenode" href="javascript:void(0)" onclick="; highlightNode(this);" oncontextmenu="return showContext(this);">Template 1</a>
Ruby code:
link(:my_template1, :xpath=> "//a[#class='treenode' and contains(text(),'Template 1')]")
what is that I am doing wrong?
Your XPath expressions seem to be OK, but the link might be generated dynamically by some JavaScript, so you get NoSuchElementException as it's not present in the initial DOM. You can try to wait until link appears in the DOM.
I'm not sure about the Ruby syntax. Try this and let me know if it doesn't work:
wait = Selenium::WebDriver::Wait.new(:timeout => 15)
element = wait.until { driver.find_element(:link_text => "Template 1") }
element.click

XPath - extracting text between two nodes

I'm encountering a problem with my XPath query. I have to parse a div which is divided to unknown number of "sections". Each of these is separated by h5 with a section name. The list of possible section titles is known and each of them can occur only once. Additionally, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".
HTML
<div class="some-class">
<h5>FirstHeader</h5>
text1
<h5>SecondHeader</h5>
text2a<br>
text2b
<h5>ThirdHeader</h5>
text3a<br>
text3b<br>
text3c<br>
<h5>FourthHeader</h5>
text4
</div>
Expected result (for SecondSection)
['text2a', 'text2b']
Query #1
//text()[following-sibling::h5/text()='ThirdHeader']
Result #1
['text1', 'text2a', 'text2b']
It's obviously bit too much, so I've decided to restrict the result to the content between selected header and the header before.
Query #2
//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']
Result #2
['text2a', 'text2b']
Yielded results meet the expectations. However, this can't be used - I don't know whether SecondHeader/ThirdHeader will exist in parsed page or not. It is needed to use only one section title in a query.
Query #3
//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]
Result #3
[]
Could you please tell me what am I doing wrong? I've tested it in Google Chrome.
If all h5 elements and text nodes are siblings, and you need to group by section, a possible option is simply to select text nodes by count of h5 that come before.
Example using lxml (in Python)
>>> import lxml.html
>>> s = '''
... <div class="some-class">
... <h5>FirstHeader</h5>
... text1
... <h5>SecondHeader</h5>
... text2a<br>
... text2b
... <h5>ThirdHeader</h5>
... text3a<br>
... text3b<br>
... text3c<br>
... <h5>FourthHeader</h5>
... text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n text2a', '\n text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n text3a', '\n text3b', '\n text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n text4\n']
>>>
You should be able to just test the first preceding sibling h5...
//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

WebDriver Capture Text by XPath

I am attempting to capture a line of text for an automated WebDriver test to use it in a comparison later on. However, I cannot find an XPath that will work with WebDriver. I have used the text() function before to capture text that is not in a tag, but in this instance that is not working. Here is the HTML, note that this text will never be the same, so I cannot use contains or similar functions.
<div id="content" class="center ui-content" data-role="content" role="main">
<div data-iscroll="scroller">
<div class="ui-corner-all ui-controlgroup ui-controlgroup-vertical" data-role="controlgroup">
<a class="ui-btn ui-corner-top ui-btn-hover-c" style="text-align: left" data-role="button" onclick="onDocumentClicked(21228772, "document.php?loan=********&folderseq=0&itemnum=21228772&pageCount=3&imageTypeName=1003 Application - Final&firstInitial=&lastName=")" href="#" data-corners="true" data-shadow="true" data-iconshadow="true" data-wrapperels="span" data-theme="c">
<span class="ui-btn-inner ui-corner-top">
<span class="ui-btn-text">
<img class="checkMark checkMark21228772 notViewedCompletely" width="15" height="15" title="You have not yet viewed this document." src="../images/white_dot.gif"/>
1003 Application - Final. (Jan 11 2012 5:04PM)
</span>
</span>
</a>
In this example, the text I am attempting to capture is: 1003 Application - Final. (Jan 11 2012 5:04PM)
I have inspected the element with Firebug and I have tried the following XPaths with no success.
html/body/div[1]/div[2]/div/div/a[1]/span/span
html/body/div[1]/div[2]/div/div/a[1]/span/span/text()
The WebDriver test is being written in C#.
You can either use this
driver.FindElement(By.XPath(".//div[#id='content']/following-sibling::span[#class='ui-btn-text']")
or
var elem = driver.FindElement(By.Id("Content"));
string text = string.Empty;
if(elem!=null) {
var textElem = elem.FindElement(By.Xpath(".//following-sibling::span[#class='ui-btn-text']"));
if(textElem!=null) text = textElem.Text();
}
I was able to solve this issue by removing the span tags from the XPath.
GetText("html/body/div[3]/div[2]/div/div/a[1]", SelectorType.XPath);
python webdriver code looks something like
driver.find_element_by_xpath("//span[#class='ui-btn-text']").text
But locator may be not uniqe, because I can't see all the code
PS Try to never use locators like html/body/div[1]/div[2]/div/div/a[1]/span/span
Approach:
Find the CSS Selector from the Given DOM
Derived CSS:css=#content div.ui-controlgroup > a[onclick*='onDocumentClicked'] > span > span
Use the C# Library Method to get the Text.

Resources