XPath - extracting text between two nodes - xpath

I'm encountering a problem with my XPath query. I have to parse a div which is divided to unknown number of "sections". Each of these is separated by h5 with a section name. The list of possible section titles is known and each of them can occur only once. Additionally, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".
HTML
<div class="some-class">
<h5>FirstHeader</h5>
text1
<h5>SecondHeader</h5>
text2a<br>
text2b
<h5>ThirdHeader</h5>
text3a<br>
text3b<br>
text3c<br>
<h5>FourthHeader</h5>
text4
</div>
Expected result (for SecondSection)
['text2a', 'text2b']
Query #1
//text()[following-sibling::h5/text()='ThirdHeader']
Result #1
['text1', 'text2a', 'text2b']
It's obviously bit too much, so I've decided to restrict the result to the content between selected header and the header before.
Query #2
//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']
Result #2
['text2a', 'text2b']
Yielded results meet the expectations. However, this can't be used - I don't know whether SecondHeader/ThirdHeader will exist in parsed page or not. It is needed to use only one section title in a query.
Query #3
//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]
Result #3
[]
Could you please tell me what am I doing wrong? I've tested it in Google Chrome.

If all h5 elements and text nodes are siblings, and you need to group by section, a possible option is simply to select text nodes by count of h5 that come before.
Example using lxml (in Python)
>>> import lxml.html
>>> s = '''
... <div class="some-class">
... <h5>FirstHeader</h5>
... text1
... <h5>SecondHeader</h5>
... text2a<br>
... text2b
... <h5>ThirdHeader</h5>
... text3a<br>
... text3b<br>
... text3c<br>
... <h5>FourthHeader</h5>
... text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n text2a', '\n text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n text3a', '\n text3b', '\n text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n text4\n']
>>>

You should be able to just test the first preceding sibling h5...
//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

Related

Unable to find element by attribute with lxml

I'm using a European Space Agency API to query (result can be viewed here) for satellite image metadata to parse into python objects.
Using the requests library I can successfully get the result in XML format and then read the content with lxml. I am able to find the elements and explore the tree as expected:
# loading the response into an ElementTree
tree = etree.fromstring(response.content)
root = tree.getroot()
ns = root.nsmap
# get the first entry element and its summary
e = root.find('entry',ns)
summary = e.find('summary',ns).text
print summary
>> 'Date: 2018-11-28T09:10:56.879Z, Instrument: OLCI, Mode: , Satellite: Sentinel-3, Size: 713.99 MB'
The entry element has several date descendants with different values of the attriubute name:
for d in e.findall('date',ns):
print d.tag, d.attrib
>> {http://www.w3.org/2005/Atom}date {'name': 'creationdate'}
{http://www.w3.org/2005/Atom}date {'name': 'beginposition'}
{http://www.w3.org/2005/Atom}date {'name': 'endposition'}
{http://www.w3.org/2005/Atom}date {'name': 'ingestiondate'}
I want to grab the beginposition date element using XPath syntax [#attrib='value'] but it just returns None. Even just searching for a date element with the name attribute ([#attrib]) returns None:
dt_begin = e.find('date[#name="beginposition"]',ns) # dt_begin is None
dt_begin = e.find('date[#name]',ns) # dt_begin is None
The entry element includes other children that exhibit the same behaviour e.g. multiple str elements also with differing name attributes.
Has anyone encountered anything similar or is there something I'm missing? I'm using Python 2.7.14 with lxml 4.2.4
It looks like an explicit prefix is needed when a predicate ([#name="beginposition"]) is used. Here is a test program:
from lxml import etree
print etree.LXML_VERSION
tree = etree.parse("data.xml")
ns1 = tree.getroot().nsmap
print ns1
print tree.find('entry', ns1)
print tree.find('entry/date', ns1)
print tree.find('entry/date[#name="beginposition"]', ns1)
ns2 = {"atom": 'http://www.w3.org/2005/Atom'}
print tree.find('atom:entry', ns2)
print tree.find('atom:entry/atom:date', ns2)
print tree.find('atom:entry/atom:date[#name="beginposition"]', ns2)
Output:
(4, 2, 5, 0)
{None: 'http://www.w3.org/2005/Atom', 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/'}
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750b90>
<Element {http://www.w3.org/2005/Atom}date at 0x7f89877503f8>
None
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750098>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a950>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a7a0>

Scrapy: How to get a correct selector

I would like to select the following text:
Bold normal Italics
I need to select and get: Bold normal italist.
The html is:
<strong>Bold</strong> normal <i>Italist</i>
However, a/text() yields
normal
only. Does anyone know a fix? I'm testing bing crawling, and the bold text is in different position depending on the query.
You can use a//text() instead of a/text() to get all text items.
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
doc = """
<strong>Bold</strong> normal <i>Italist</i>
"""
sel = Selector(text=doc, type="html")
result = sel.xpath('//a/text()').extract()
print result
# >>> [u' normal ']
result = u''.join(sel.xpath('//a//text()').extract())
print result
# >>> Bold normal Italist
You can try to use
a/string()
or
normalize-space(a)
which returns Bold normal Italist

Trying to get table rows with Scrapy xpath

I have some html that looks like the screenshot. I want to get the table rows. I have:
for table_row in response.selector.xpath("//*[#id = 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties']"):
print table_row
In the command line I tried:
>>> table_row
Out[5]: <Selector xpath="//*[#id = 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties']" data=u'<table class="ParamText" cellspacing="0"'>
>>> table_row.xpath('/tbody')
Out[6]: []
>>> table_row.xpath('//tbody')
Out[7]: []
Why am I unable to select the tbody?
tbody is generated by the browser, you don't get it with Scrapy downloader. Just get straight to the tr elements:
table_row.xpath('.//tr')

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

Capybara, rspec- How to find text anywhere on page

There are multiple ways to find it but I want to do this in a specific manner. Here it is-
To get an element with some text in it, my framework creates an xpath in this manner-
#xpath = "//h1[contains(text(), '[the-text-i-am-searching-for]')]"
Then it executes-
find(:xpath, #xpath).visible?
Now in similar format I want to create an xpath which just looks for a text anywhere in the page and then can be used in find(:xpath,#xpath).visible? to return a true or false.
To give a little more context:
My HTML paragraph looks something like this-
<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>
but if I try to find it using find(:xpath, #xpath) where my xpath is
#xpath = "//p[contains(text(), '[the-text-i-am-searching-for]')]"
it fails.
Try replacing "//p[contains(text(), '[the-text-i-am-searching-for]')]" with "//p[contains(., '[the-text-i-am-searching-for]')]"
I don't know your environment but in Python with lxml it works:
>>> import lxml.etree
>>> doc = lxml.etree.HTML("""<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>""")
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]
>>> doc.xpath('//p[contains(., "[the-text-i-am-searching-for]")]')
[<Element p at 0x1c1b9b0>]
>>>
The context node . will be converted to a string to match the signature boolean contains(string, string) (http://www.w3.org/TR/xpath/#section-String-Functions)
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>>
Consider these variations
>>> doc.xpath('//p')
[<Element p at 0x1c1b9b0>]
>>> doc.xpath('//p/*')
[<Element b at 0x1e34b90>, <Element a at 0x1e34af0>]
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>> doc.xpath('//p/text()')
['some text here ', ' again some text ', ' [the-text-i-am-searching-for]']
>>> doc.xpath('string(//p/text())')
'some text here '
>>> doc.xpath('//p/text()[3]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p/text()[contains(., "[the-text-i-am-searching-for]")]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]

Resources