There are multiple ways to find it but I want to do this in a specific manner. Here it is-
To get an element with some text in it, my framework creates an xpath in this manner-
#xpath = "//h1[contains(text(), '[the-text-i-am-searching-for]')]"
Then it executes-
find(:xpath, #xpath).visible?
Now in similar format I want to create an xpath which just looks for a text anywhere in the page and then can be used in find(:xpath,#xpath).visible? to return a true or false.
To give a little more context:
My HTML paragraph looks something like this-
<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>
but if I try to find it using find(:xpath, #xpath) where my xpath is
#xpath = "//p[contains(text(), '[the-text-i-am-searching-for]')]"
it fails.
Try replacing "//p[contains(text(), '[the-text-i-am-searching-for]')]" with "//p[contains(., '[the-text-i-am-searching-for]')]"
I don't know your environment but in Python with lxml it works:
>>> import lxml.etree
>>> doc = lxml.etree.HTML("""<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>""")
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]
>>> doc.xpath('//p[contains(., "[the-text-i-am-searching-for]")]')
[<Element p at 0x1c1b9b0>]
>>>
The context node . will be converted to a string to match the signature boolean contains(string, string) (http://www.w3.org/TR/xpath/#section-String-Functions)
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>>
Consider these variations
>>> doc.xpath('//p')
[<Element p at 0x1c1b9b0>]
>>> doc.xpath('//p/*')
[<Element b at 0x1e34b90>, <Element a at 0x1e34af0>]
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>> doc.xpath('//p/text()')
['some text here ', ' again some text ', ' [the-text-i-am-searching-for]']
>>> doc.xpath('string(//p/text())')
'some text here '
>>> doc.xpath('//p/text()[3]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p/text()[contains(., "[the-text-i-am-searching-for]")]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]
Related
I'm using a European Space Agency API to query (result can be viewed here) for satellite image metadata to parse into python objects.
Using the requests library I can successfully get the result in XML format and then read the content with lxml. I am able to find the elements and explore the tree as expected:
# loading the response into an ElementTree
tree = etree.fromstring(response.content)
root = tree.getroot()
ns = root.nsmap
# get the first entry element and its summary
e = root.find('entry',ns)
summary = e.find('summary',ns).text
print summary
>> 'Date: 2018-11-28T09:10:56.879Z, Instrument: OLCI, Mode: , Satellite: Sentinel-3, Size: 713.99 MB'
The entry element has several date descendants with different values of the attriubute name:
for d in e.findall('date',ns):
print d.tag, d.attrib
>> {http://www.w3.org/2005/Atom}date {'name': 'creationdate'}
{http://www.w3.org/2005/Atom}date {'name': 'beginposition'}
{http://www.w3.org/2005/Atom}date {'name': 'endposition'}
{http://www.w3.org/2005/Atom}date {'name': 'ingestiondate'}
I want to grab the beginposition date element using XPath syntax [#attrib='value'] but it just returns None. Even just searching for a date element with the name attribute ([#attrib]) returns None:
dt_begin = e.find('date[#name="beginposition"]',ns) # dt_begin is None
dt_begin = e.find('date[#name]',ns) # dt_begin is None
The entry element includes other children that exhibit the same behaviour e.g. multiple str elements also with differing name attributes.
Has anyone encountered anything similar or is there something I'm missing? I'm using Python 2.7.14 with lxml 4.2.4
It looks like an explicit prefix is needed when a predicate ([#name="beginposition"]) is used. Here is a test program:
from lxml import etree
print etree.LXML_VERSION
tree = etree.parse("data.xml")
ns1 = tree.getroot().nsmap
print ns1
print tree.find('entry', ns1)
print tree.find('entry/date', ns1)
print tree.find('entry/date[#name="beginposition"]', ns1)
ns2 = {"atom": 'http://www.w3.org/2005/Atom'}
print tree.find('atom:entry', ns2)
print tree.find('atom:entry/atom:date', ns2)
print tree.find('atom:entry/atom:date[#name="beginposition"]', ns2)
Output:
(4, 2, 5, 0)
{None: 'http://www.w3.org/2005/Atom', 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/'}
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750b90>
<Element {http://www.w3.org/2005/Atom}date at 0x7f89877503f8>
None
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750098>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a950>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a7a0>
I'm looking for a table which contains ASCII characters and same looking UTF8 characters. I know it also depends on the font is they look the same, but something generic to start with is enough.
>>> # PY3 code:
>>> a='H' # ascii
>>> b='Н' # utf8
>>> a==b
False
>>> ' '.join(format(ord(x), 'b') for x in a)
'1001000'
>>> ' '.join(format(ord(x), 'b') for x in b)
'10000011101'
>>> a='P' # ascii
>>> b='Ρ' # utf8
>>> a==b
False
>>> ' '.join(format(ord(x), 'b') for x in a)
'1010000'
>>> ' '.join(format(ord(x), 'b') for x in b)
'1110100001'
This is very useful tool as it will show you all characters which look similar and you can choose if this is REALLY similar enough for you :)
https://unicode.org/cldr/utility/confusables.jsp?a=test&r=None
Some other resources:
This is called Visual Spoofing
Python Package to detect confusables
I would like to select the following text:
Bold normal Italics
I need to select and get: Bold normal italist.
The html is:
<strong>Bold</strong> normal <i>Italist</i>
However, a/text() yields
normal
only. Does anyone know a fix? I'm testing bing crawling, and the bold text is in different position depending on the query.
You can use a//text() instead of a/text() to get all text items.
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
doc = """
<strong>Bold</strong> normal <i>Italist</i>
"""
sel = Selector(text=doc, type="html")
result = sel.xpath('//a/text()').extract()
print result
# >>> [u' normal ']
result = u''.join(sel.xpath('//a//text()').extract())
print result
# >>> Bold normal Italist
You can try to use
a/string()
or
normalize-space(a)
which returns Bold normal Italist
I have some html that looks like the screenshot. I want to get the table rows. I have:
for table_row in response.selector.xpath("//*[#id = 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties']"):
print table_row
In the command line I tried:
>>> table_row
Out[5]: <Selector xpath="//*[#id = 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties']" data=u'<table class="ParamText" cellspacing="0"'>
>>> table_row.xpath('/tbody')
Out[6]: []
>>> table_row.xpath('//tbody')
Out[7]: []
Why am I unable to select the tbody?
tbody is generated by the browser, you don't get it with Scrapy downloader. Just get straight to the tr elements:
table_row.xpath('.//tr')
I'm encountering a problem with my XPath query. I have to parse a div which is divided to unknown number of "sections". Each of these is separated by h5 with a section name. The list of possible section titles is known and each of them can occur only once. Additionally, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".
HTML
<div class="some-class">
<h5>FirstHeader</h5>
text1
<h5>SecondHeader</h5>
text2a<br>
text2b
<h5>ThirdHeader</h5>
text3a<br>
text3b<br>
text3c<br>
<h5>FourthHeader</h5>
text4
</div>
Expected result (for SecondSection)
['text2a', 'text2b']
Query #1
//text()[following-sibling::h5/text()='ThirdHeader']
Result #1
['text1', 'text2a', 'text2b']
It's obviously bit too much, so I've decided to restrict the result to the content between selected header and the header before.
Query #2
//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']
Result #2
['text2a', 'text2b']
Yielded results meet the expectations. However, this can't be used - I don't know whether SecondHeader/ThirdHeader will exist in parsed page or not. It is needed to use only one section title in a query.
Query #3
//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]
Result #3
[]
Could you please tell me what am I doing wrong? I've tested it in Google Chrome.
If all h5 elements and text nodes are siblings, and you need to group by section, a possible option is simply to select text nodes by count of h5 that come before.
Example using lxml (in Python)
>>> import lxml.html
>>> s = '''
... <div class="some-class">
... <h5>FirstHeader</h5>
... text1
... <h5>SecondHeader</h5>
... text2a<br>
... text2b
... <h5>ThirdHeader</h5>
... text3a<br>
... text3b<br>
... text3c<br>
... <h5>FourthHeader</h5>
... text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n text2a', '\n text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n text3a', '\n text3b', '\n text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n text4\n']
>>>
You should be able to just test the first preceding sibling h5...
//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]