Unable to find element by attribute with lxml - xpath

I'm using a European Space Agency API to query (result can be viewed here) for satellite image metadata to parse into python objects.
Using the requests library I can successfully get the result in XML format and then read the content with lxml. I am able to find the elements and explore the tree as expected:
# loading the response into an ElementTree
tree = etree.fromstring(response.content)
root = tree.getroot()
ns = root.nsmap
# get the first entry element and its summary
e = root.find('entry',ns)
summary = e.find('summary',ns).text
print summary
>> 'Date: 2018-11-28T09:10:56.879Z, Instrument: OLCI, Mode: , Satellite: Sentinel-3, Size: 713.99 MB'
The entry element has several date descendants with different values of the attriubute name:
for d in e.findall('date',ns):
print d.tag, d.attrib
>> {http://www.w3.org/2005/Atom}date {'name': 'creationdate'}
{http://www.w3.org/2005/Atom}date {'name': 'beginposition'}
{http://www.w3.org/2005/Atom}date {'name': 'endposition'}
{http://www.w3.org/2005/Atom}date {'name': 'ingestiondate'}
I want to grab the beginposition date element using XPath syntax [#attrib='value'] but it just returns None. Even just searching for a date element with the name attribute ([#attrib]) returns None:
dt_begin = e.find('date[#name="beginposition"]',ns) # dt_begin is None
dt_begin = e.find('date[#name]',ns) # dt_begin is None
The entry element includes other children that exhibit the same behaviour e.g. multiple str elements also with differing name attributes.
Has anyone encountered anything similar or is there something I'm missing? I'm using Python 2.7.14 with lxml 4.2.4

It looks like an explicit prefix is needed when a predicate ([#name="beginposition"]) is used. Here is a test program:
from lxml import etree
print etree.LXML_VERSION
tree = etree.parse("data.xml")
ns1 = tree.getroot().nsmap
print ns1
print tree.find('entry', ns1)
print tree.find('entry/date', ns1)
print tree.find('entry/date[#name="beginposition"]', ns1)
ns2 = {"atom": 'http://www.w3.org/2005/Atom'}
print tree.find('atom:entry', ns2)
print tree.find('atom:entry/atom:date', ns2)
print tree.find('atom:entry/atom:date[#name="beginposition"]', ns2)
Output:
(4, 2, 5, 0)
{None: 'http://www.w3.org/2005/Atom', 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/'}
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750b90>
<Element {http://www.w3.org/2005/Atom}date at 0x7f89877503f8>
None
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750098>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a950>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a7a0>

Related

YAML mapping order not preserved when using alias and yamlordereddictloader loader

I want to load a YAML file into Python as an OrderedDict. I am using yamlordereddictloader to preserve ordering.
However, I notice that the aliased object is placed "too soon" in the OrderedDict in the output.
How can I preserve the order of this mapping when read into Python, ideally as an OrderedDict? Is it possible to achieve this result without writing some custom parsing?
Notes:
I'm not particularly concerned with the method used, as long as the end result is the same.
Using sequences instead of mappings is problematic because they can result in nested output, and I can't simply flatten everything (some nestedness is appropriate).
When I try to just use !!omap, I cannot seem to merge the aliased mapping (d1.dt) into the d2 mapping.
I'm in Python 3.6, if I don't use this loader or !!omap order is not preserved (apparently contrary to the top 'Update' here: https://stackoverflow.com/a/21912744/2343633)
import yaml
import yamlordereddictloader
yaml_file = """
d1:
id:
nm1: val1
dt: &dt
nm2: val2
nm3: val3
d2: # expect nm4, nm2, nm3
nm4: val4
<<: *dt
"""
out = yaml.load(yaml_file, Loader=yamlordereddictloader.Loader)
keys = [x for x in out['d2']]
print(keys) # ['nm2', 'nm3', 'nm4']
assert keys==['nm4', 'nm2', 'nm3'], "order from YAML file is not preserved, aliased keys placed too early"
Is it possible to achieve this result without writing some custom parsing?
Yes. You need to override the method flatten_mapping from SafeConstructor. Here's a basic working example:
import yaml
import yamlordereddictloader
from yaml.constructor import *
from yaml.reader import *
from yaml.parser import *
from yaml.resolver import *
from yaml.composer import *
from yaml.scanner import *
from yaml.nodes import *
class MyLoader(yamlordereddictloader.Loader):
def __init__(self, stream):
yamlordereddictloader.Loader.__init__(self, stream)
# taken from here and reengineered to keep order:
# https://github.com/yaml/pyyaml/blob/5.3.1/lib/yaml/constructor.py#L207
def flatten_mapping(self, node):
merged = []
def merge_from(node):
if not isinstance(node, MappingNode):
raise yaml.ConstructorError("while constructing a mapping",
node.start_mark, "expected mapping for merging, but found %s" %
node.id, node.start_mark)
self.flatten_mapping(node)
merged.extend(node.value)
for index in range(len(node.value)):
key_node, value_node = node.value[index]
if key_node.tag == u'tag:yaml.org,2002:merge':
if isinstance(value_node, SequenceNode):
for subnode in value_node.value:
merge_from(subnode)
else:
merge_from(value_node)
else:
if key_node.tag == u'tag:yaml.org,2002:value':
key_node.tag = u'tag:yaml.org,2002:str'
merged.append((key_node, value_node))
node.value = merged
yaml_file = """
d1:
id:
nm1: val1
dt: &dt
nm2: val2
nm3: val3
d2: # expect nm4, nm2, nm3
nm4: val4
<<: *dt
"""
out = yaml.load(yaml_file, Loader=MyLoader)
keys = [x for x in out['d2']]
print(keys)
assert keys==['nm4', 'nm2', 'nm3'], "order from YAML file is not preserved, aliased keys placed too early"
This has not the best performance as it basically copies all key-value pairs from all mappings once each during loading, but it's working. Performance enhancement is left as an exercise for the reader :).

How to filter XML elements by date range in Ruby

I typically use Nokogiri as my XML parser.
I have the following XML:
<albums>
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<classix_nouveaux album="Night People"/>
<release_date value="19820501"/>
</classix_nouveaux>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
</albums>
I want to get all albums that were released between 1/1/1980 and 4/15/1982:
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
How do I filter/query the XML by a release_date range?
Your XML is malformed. After parsing, here's what Nokogiri has to say about it:
doc.errors
# => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: albums line 1 and classix_nouveaux>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>]
That's because:
<classix_nouveaux album="Night People"/>
and
<engligh_beat album="I Just Can't Stop It"/>
are terminated. Instead they should be:
<classix_nouveaux album="Night People">
and
<engligh_beat album="I Just Can't Stop It">
You can use CSS or XPath selectors to find exact matches, or even sub-string matches, but neither CSS or XPath understand "ranges" of dates, nor do they have an idea of what a Date is, so you'd have to extract all nodes, convert the date value into a Date object or integer in this case, then compare to the range:
date_range = 19800501..19820401
selected_albums = doc.search('//release_date').select { |rd| date_range.include?(rd['value'].to_i) }.map { |rd| rd.parent }
selected_albums.map(&:to_xml)
# => ["<aldo_nova album=\"aldo nova\">\n" +
# " <release_date value=\"19820401\"/>\n" +
# "</aldo_nova>",
# "<engligh_beat album=\"I Just Can't Stop It\">\n" +
# " <release_date value=\"19800501\"/>\n" +
# "</engligh_beat>"]
I think your XML is poorly designed because you have varying tag names for what should be an album. <album> should be a child of <albums>. I'd recommend something like this:
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
Once the XML is in a standard form, then it becomes easier to navigate and search:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
EOT
doc.search('album').last['title'] # => "I Just Can't Stop It"
band = 'aldo nova'
doc.search("//album[#band='#{band}']").map { |a| a['title'] } # => ["aldo nova"]
and searching for dates becomes more straightforward because it's not necessary to find the parent of the node:
date_range = 19800501..19820401
selected_albums = doc.search('album').select { |a| date_range.include?(a['release_date'].to_i) }
selected_albums.map(&:to_xml)
# => ["<album band=\"aldo nova\" title=\"aldo nova\" release_date=\"19820401\"/>",
# "<album band=\"english beat\" title=\"I Just Can't Stop It\" release_date=\"19800501\"/>"]
I'd recommend reading some tutorials on XML itself as it's easy to paint ourselves into corners if the data isn't represented logically and correctly.

None type returned / Tag not found

I have to parse a file to find min and mult qty of each sku
<product sku="13603">
<sku>13603</sku>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
<product sku="13713">
<sku>13713</sku>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
...
My program is very simple
from lxml import etree
tree = etree.parse('./file-above.xml')
for elem in tree.iterfind('product'):
vSKU = elem.find('sku').text
vMin = elem.find('quantity/min_order_quantity').text
When I run it, it creates an error:
AttributeError: 'NoneType' object has no attribute 'text'
When run interactively and changing the last line to...
print elem.find('sku').text
it works, but the line...
print elem.find('quantity/min_order_quantity').text
fails. What's wrong ?
You have a typo in your XPath, you need vMin = elem.find('quantity/min-order-quantity').text instead of vMin = elem.find('quantity/min_order_quantity').text (i.e. hyphen instead of underscore)

What is the better logic for identifying the latest versioned files from a big list

I have 200 images in a folder and each file may contain several versions (example: car_image#2, car_image#2, bike_image#2, etc ). My requirement is to build a utility to copy all the latest files from this directory to another.
My approach is:
Put the imagesNames (without containing version numbers) into a list
Eliminate the duplicates from the list
Iterate through the list and identify the latest version of each unique file (I am little blurred on this step)
Can someone throw some better ideas/algorithm to achieve this?
My approach would be:
Make a list of unique names by getting each filename up to the #, only adding unique values.
Make a dictionary with filenames as keys, and set values to be the version number, updating when it's larger than the one stored.
Go through the dictionary and produce the filenames to grab.
My go-to would be a python script but you should be able to do this in pretty much whatever language you find suitable.
Ex code for getting the filename list:
#get the filename list
myList = []
for x in file_directory:
fname = x.split("#")[0]
if not fname in myList:
myList = myList + [fname]
myDict = {}
for x in myList:
if not x in myDict:
myDict[x] = 0
for x in file_directory:
fversion = x.split("#")[-1]
if myDict[x] < int(fversion):
myDict[x] = fversion
flist = []
for x in myDict:
fname = str(x) + "#" + str(myDict[x])
flist.append(fname)
Then flist would be a list of filenames of the most recent versions
I didn't run this or anything but hopefully it helps!
In Python 3
>>> images = sorted(set(sum([['%s_image#%i' % (nm, random.randint(1,9)) for i in range(random.randint(2,5))] for nm in 'car bike cat dog man tree'.split()], [])))
>>> print('\n'.join(images))
bike_image#2
bike_image#3
bike_image#4
bike_image#5
car_image#2
car_image#7
cat_image#3
dog_image#2
dog_image#5
dog_image#9
man_image#1
man_image#2
man_image#4
man_image#6
man_image#7
tree_image#3
tree_image#4
>>> from collections import defaultdict
>>> image2max = defaultdict(int)
>>> for image in images:
name, _, version = image.partition('#')
version = int(version)
if version > image2max[name]:
image2max[name] = version
>>> # Max version
>>> for image in sorted(image2max):
print('%s#%i' % (image, image2max[image]))
bike_image#5
car_image#7
cat_image#3
dog_image#9
man_image#7
tree_image#4
>>>

XPath - extracting text between two nodes

I'm encountering a problem with my XPath query. I have to parse a div which is divided to unknown number of "sections". Each of these is separated by h5 with a section name. The list of possible section titles is known and each of them can occur only once. Additionally, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".
HTML
<div class="some-class">
<h5>FirstHeader</h5>
text1
<h5>SecondHeader</h5>
text2a<br>
text2b
<h5>ThirdHeader</h5>
text3a<br>
text3b<br>
text3c<br>
<h5>FourthHeader</h5>
text4
</div>
Expected result (for SecondSection)
['text2a', 'text2b']
Query #1
//text()[following-sibling::h5/text()='ThirdHeader']
Result #1
['text1', 'text2a', 'text2b']
It's obviously bit too much, so I've decided to restrict the result to the content between selected header and the header before.
Query #2
//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']
Result #2
['text2a', 'text2b']
Yielded results meet the expectations. However, this can't be used - I don't know whether SecondHeader/ThirdHeader will exist in parsed page or not. It is needed to use only one section title in a query.
Query #3
//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]
Result #3
[]
Could you please tell me what am I doing wrong? I've tested it in Google Chrome.
If all h5 elements and text nodes are siblings, and you need to group by section, a possible option is simply to select text nodes by count of h5 that come before.
Example using lxml (in Python)
>>> import lxml.html
>>> s = '''
... <div class="some-class">
... <h5>FirstHeader</h5>
... text1
... <h5>SecondHeader</h5>
... text2a<br>
... text2b
... <h5>ThirdHeader</h5>
... text3a<br>
... text3b<br>
... text3c<br>
... <h5>FourthHeader</h5>
... text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n text2a', '\n text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n text3a', '\n text3b', '\n text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n text4\n']
>>>
You should be able to just test the first preceding sibling h5...
//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

Resources