Switching from beautifulsoup to htmlelement - how to find elements - xpath

I have an exsiting process that extracts elements from html documents that make use of the xbrli xml standard.
And example of a document can be found here:
The process works well (I'm using multiprocessing to work in parallel) but I have ~20m html and xml files to process and I'm finding beautifulsoup is the core bottleneck.
I am looking at htmlelement as a hopefully quicker alternative to extracting the data I need but I'm struggling to find elements. For example, in BS I can do the following:
for tag in soup.find_all('xbrli:unit'):
l_unitid = tag.attrs.get('id')
l_value = tag.text
l_unit_dict[l_unitid] = {'unitid':l_unitid,'value':l_value}
Which will find all xbrli:unit tags and I can extract their values easily.
However, when I try something similar in htmlelement I get the following exception:
import htmlement
source = htmlement.parse("Prod223_2542_00010416_20190331.html")
for tag in source.iterfind('.//xbrli:unit'):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
SyntaxError: prefix 'xbrli' not found in prefix map
A bit of googling led me to a few articles, but I can't seem to make progress
SyntaxError: prefix 'a' not found in prefix map
Parsing XML with namespace in Python via 'ElementTree'
I've tried adding in a namespace map but it's just not finding anything, no matter which way round I put things, or what tags I look for
source = htmlement.parse("Prod223_2542_00010416_20190331.html")
namespaces = {'xbrli': 'period'}
for tag in source.iterfind('.//xbrli:period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
namespaces = {'xbrli': 'period'}
for tag in source.iterfind('.//{xbrli}period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
namespaces = {'period':'xbrli'}
for tag in source.iterfind('.//{xbrli}period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
namespaces = {'period':'xbrli'}
for tag in source.iterfind('.//period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
All return nothing - they don't enter the loop. I've clearly got something very wrong in my understanding of how to use the elementree structure vs BS, but I don't quite know how to move from one to the other.
Any suggestions would be welcome.

Two general comments before I get to a proposed answer:
First, you are dealing with an xml document, so it's generally better to use an xml, not html, parser. So that's what I'm using below instead of beautifull soup or htmlelement.
Second, about xbrl generally: from bitter experience (and as many others pointed out), xbrl is terrible. It's shiny on the surface, but once you pop the hood, it's a mess. So I don't envy you...
And, with that said, I tried to approximate what you are likely looking for. I didn't bother to create dictionaries or lists, and just used print() statement. Obviously, if it helps you, you can modify it to your own requirements:
from lxml import etree
import requests
r = requests.get('https://beta.companieshouse.gov.uk/company/00010416/filing-history/MzI1MTU3MzQzMmFkaXF6a2N4/document?format=xhtml&download=1')
root = etree.fromstring(r.content)
units = root.xpath(".//*[local-name()='unit'][#id]/#id")
for unit in units:
unit_id = unit
print('unit: ', unit)
print('----------------------------')
context = root.xpath(".//*[local-name()='context']")
for tag in context:
id = tag.xpath('./#id')
print('ID: ',id)
info = tag.xpath('./*[local-name()="entity"]')
identifier = info[0].xpath('.//*[local-name()="identifier"]')[0].text
print('identifier: ',identifier)
member = info[0].xpath('.//*[local-name()="explicitMember"]')
if len(member)>0:
dimension = member[0].attrib['dimension']
explicitMember = member[0].text
print('dimension: ',dimension,' explicit member: ',explicitMember)
periods = tag.xpath('.//*[local-name()="period"]')
for period in periods:
for child in period.getchildren():
if 'instant' in child.tag:
instant = child.text
print('instant: ',instant)
else:
dates = period.xpath('.//*')
start_date = dates[0].text
end_date = dates[1].text
print('start date: ', start_date,' end date: ',end_date)
print('===================')
A random sample from the output:
ID: ['cfwd_31_03_2018']
identifier: 00010416
instant: 2018-03-31
start date: 2017-04-01 end date: 2018-03-31
===================
ID: ['CountriesHypercube_FY_31_03_2019_Set1']
identifier: 00010416
dimension: ns15:CountriesRegionsDimension explicit member: ns15:EnglandWales
instant: 2018-03-31
start date: 2018-04-01 end date: 2019-03-31

Related

restructuredtext include partial rst directives from other files (i.e. templates)

I'm looking for a way to include partial rst directives from another file.
I have some restructured text files that have repeating table definitions, and I would like to make this a reusable template to avoid redefining the same table again and again.
multiple files list tables that all start like to following (i.e. common settings and header) but the actual data rows afterwards differ.
.. list-table::
:align: left
:header-rows: 1
:width: 100%
:widths: 25 25 25 25
* - ColumnHeader1
- ColumnHeader2
- ColumnHeader3
- ColumnHeader4
The include directive does not seem to work for this, as it will parse the line-table and then include the result, I cannot add additional rows to that line-table afterwards.
Is there any way that I could achieve this?
EDIT:
Forgot to mention that I use restructuredtext as part of sphinx. It seems it is quite easy to extend sphinx: https://www.sphinx-doc.org/en/master/development/tutorials/helloworld.html
EDIT2:
Moved my solution to an answer to this question
I found a way to implement this as a sphinx extension (following this tutorial https://www.sphinx-doc.org/en/master/development/tutorials/helloworld.html)
from docutils.parsers.rst.directives.tables import ListTable
from docutils.statemachine import ViewList
class MyTable(ListTable):
def run(self):
header_rows = ViewList([
'* - ColumnHeader1',
' - ColumnHeader2',
' - ColumnHeader3',
' - ColumnHeader4',
])
self.content = header_rows + self.content
self.options['header-rows'] = 1
self.options['width'] = '100%'
self.options['widths'] = '25 25 25 25'
self.options['align'] = 'left'
return super().run()
def setup(app):
app.add_directive('my-table', MyTable)
return {
'version': '0.1',
'parallel_read_safe': True,
'parallel_write_safe': True,
}
This let's me define tables like this now:
.. my-table::
* - Column 1
- Column 2
- Column 3
- Column 4

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).
I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.
In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.
^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.
What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?
Emitter.expect_flow_mapping() probably too late for converting flow->block
Serializer.serialize_node() probably too late as it consults node.flow_style
RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
I could also walk the data before calling yaml.dump(), but this has no idea about line length.
So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?
What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.
Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.
A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.
If you run the following, the initial dumped value for quote will be on one line.
The change_to_block method as presented changes all mappings/sequences that are too long
that are on one line.
import sys
import ruamel.yaml
yaml_str = """\
movie: bladerunner
quote: {[Batty, Roy]: [
I have seen things you people wouldn't believe.,
Attack ships on fire off the shoulder of Orion.,
I watched C-beams glitter in the dark near the Tannhäuser Gate.,
]}
"""
class Blockify:
def __init__(self, width, only_first=False, verbose=0):
self._width = width
self._yaml = None
self._only_first = only_first
self._verbose = verbose
#property
def yaml(self):
if self._yaml is None:
self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
y.preserve_quotes = True
y.width = 2**16
return self._yaml
def __call__(self, d):
pass_nr = 0
changed = [True]
while changed[0]:
changed[0] = False
try:
s = self.yaml.dumps(d)
except AttributeError:
print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
sys.exit(1)
if self._verbose > 1:
print(s)
too_long = set()
max_ll = -1
for line_nr, line in enumerate(s.splitlines()):
if len(line) > self._width:
too_long.add(line_nr)
if len(line) > max_ll:
max_ll = len(line)
if self._verbose > 0:
print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
sys.stdout.flush()
new_d = self.yaml.load(s)
self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
d = new_d
pass_nr += 1
return d, s
#staticmethod
def change_to_block(d, too_long, changed, only_first):
if isinstance(d, dict):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
# don't change keys if any value is changed
for v in d.values():
Blockify.change_to_block(v, too_long, changed, only_first)
if only_first and changed[0]:
return
if changed[0]: # don't change keys if value has changed
return
for k in d:
Blockify.change_to_block(k, too_long, changed, only_first)
if only_first and changed[0]:
return
if isinstance(d, (list, tuple)):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
for elem in d:
Blockify.change_to_block(elem, too_long, changed, only_first)
if only_first and changed[0]:
return
blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data = yaml.load(yaml_str)
blockified_data, string_output = blockify(data)
print('-'*32, 'result:', '-'*32)
print(string_output) # string_output has no final newline
which gives:
movie: bladerunner
quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
pass: 0, lines: [1], longest: 186
movie: bladerunner
quote:
[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
pass: 1, lines: [2], longest: 179
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
pass: 2, lines: [], longest: 67
-------------------------------- result: --------------------------------
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style
because the tuple subclass CommentedKeySeq does never get a line number attached.

xpath could not recognize predicate for a tag

I try to use scrapy xpath to scrape a page, but it seems it cannot capture the tag with predicates when I use a for loop,
# This package will contain the spiders of your Scrapy project
from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json
class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]
def parse(self, response):
url = response.url
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
n = -1
for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
print(response.xpath("//a[#name ='MTG_CLASSNAME$10']/text()"))
n += 1
class_num = section.xpath('text()').extract_first()
# print(class_num)
classname = "MTG_CLASSNAME$" + str(n)
date = "MTG_DAYTIME$" + str(n)
instr = "MTG_INSTR$" + str(n)
print(classname)
class_name = response.xpath("//a[#name = classname]/text()")
I am looking for a tags with name as "MTG_CLASSNAME$" + str(n), with n being 0,1,2..., and I am getting empty output from my xpath query. Not sure why...
PS.
I am basically trying to scrape course and their info from https://hrsa.cunyfirst.cuny.edu/psc/cnyhcprd/GUEST/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?FolderPath=PORTAL_ROOT_OBJECT.HC_CLASS_SEARCH_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%252cIsFolder&PortalActualURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentProvider=HRMS&PortalCRefLabel=Class%20Search&PortalRegistryName=GUEST&PortalServletURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsp%2fcnyepprd%2f&PortalURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsc%2fcnyepprd%2f&PortalHostNode=ENTP&NoCrumbs=yes
with filter applied: Kingsborough CC, fall 18, BIO
Thanks!
Well... I've visited the website you put in the question description, I used element inspection and searched for "MTG_CLASSNAME" and I got 0 matches...
So I will give you some tools:
In your settings.py set that:
LOG_FILE = "log.txt"
LOG_STDOUT=True
then print the response body ( response.body ) where you should ( in the top of parse_page function in this case ) and search it in log.txt
Check there if there is what you are looking for.
If there is, use this https://www.freeformatter.com/xpath-tester.html (
or similar ) to check your xpath statement.
In addition, change for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
by for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]").extract():, this will raise an error when you get the data that you are looking for.

Improve genbank feature addition

I am trying to add more than 70000 new features to a genbank file using biopython.
I have this code:
from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
for result in results:
start = 0
end = 0
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in SeqIO.parse(original, "gb"):
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")
Results is just a list of lists containing the start and end of each one of the features I need to add to the original gbk file.
This solution is extremely costly for my computer and I do not know how to improve the performance. Any good idea?
You should parse the genbank file just once. Omitting what results contains (I do not know exactly, because there are some missing pieces of code in your example), I would guess something like this would improve performance, modifying your code:
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
original_records = list(SeqIO.parse(fi, "gb"))
for result in results:
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in original_records:
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")

YAML deserializer with position information?

Does anyone know of a YAML deserializer that can provide position information for the constructed objects?
I know how to deserialize a YAML file into a Java object. Simple instructions on http://yamlbeans.sourceforge.net/.
However, I want to do some algorithmic validation on the deserialized object and report error back to the user pointing to the position in the YAML that cause the error.
Example:
=========YAML file==========
name: Nathan Sweet
age: 28
address: 4011 16th Ave S
=======JAVA class======
public class Contact {
public String name;
public int age;
public String address;
}
Imagine if I want to first load the yaml into Contact class and then validate the address against some repository and error back if its invalid. Something like:
'Line 3 Column 9: The address does not match valid entry in the database'
The problem is, currently there is no way to get the position inside a deserialized object from YAML.
Anyone know a solution to this issue?
Most YAML parsers, if they keep any information about positions around they drop it while constructing the language native objects.
In ruamel.yaml ¹, I keep more information around because I want to be able to round-trip with minimal loss of original layout (e.g. keeping comments and key order in mappings).
I don't keep information on individual key-value pairs, but I do on the "upper-left" position of a mapping². Because of the kept order of the mapping items you can give some rather nice feedback. Given an input file:
- name: anthon
age: 53
adres: Rijn en Schiekade 105
- name: Nathan Sweet
age: 28
address: 4011 16th Ave S
And a program that you call with the input file as argument:
#! /usr/bin/env python2.7
# coding: utf-8
# http://stackoverflow.com/questions/30677517/yaml-deserializer-with-position-information?noredirect=1#comment49491314_30677517
import sys
import ruamel.yaml
up_arrow = '↑'
def key_error(key, value, line, col, error, e='E'):
print('E[{}:{}]: {}'.format(line, col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(col), up_arrow))
print('---')
def value_error(key, value, line, col, error, e='E'):
val_col = col + len(key) + 2
print('{}[{}:{}]: {}'.format(e, line, val_col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(val_col), up_arrow))
print('---')
def value_warning(key, value, line, col, error):
value_error(key, value, line, col, error, e='W')
class Contact(object):
def __init__(self, vals):
for offset, k in enumerate(vals):
self.check(k, vals[k], vals.lc.line+offset, vals.lc.col)
for k in ['name', 'address', 'age']:
if k not in vals:
print('K[{}:{}]: {}'.format(
vals.lc.line+offset, vals.lc.col, "missing key: "+k
))
print('---')
def check(self, key, value, line, col):
if key == 'name':
if value[0].lower() == value[0]:
value_error(key, value, line, col,
'value should start with uppercase')
elif key == 'age':
if value < 50:
value_warning(key, value, line, col,
'probably too young for knowing ALGOL 60')
elif key == 'address':
pass
else:
key_error(key, value, line, col,
"unexpected key")
data = ruamel.yaml.load(open(sys.argv[1]), Loader=ruamel.yaml.RoundTripLoader)
for x in data:
contact = Contact(x)
giving you E(rrors), W(arnings) and K(eys missing):
E[0:8]: value should start with uppercase
name: anthon
↑
---
E[2:2]: unexpected key
adres: Rijn en Schiekade 105
↑
---
K[2:2]: missing key: address
---
W[4:7]: probably too young for knowing ALGOL 60
age: 28
↑
---
Which you should be able to parser in a calling program in any language to give feedback. The check method of course need adjusting to your requirements. This is not as nice as being to do that in the language the rest of your application is in, but it might be better than nothing.
In my experience handling the above format is certainly simpler than extending an existing (open source) YAML parser.
¹ Disclaimer: I am the author of that package
² I want to use that kind of information at some point to preserve spurious newlines, inserted for readability
In python, you can readily write custom Dumper/Loader objects and use them to load (or dump) your yaml code. You can have these objects track the file/line info:
import yaml
from collections import OrderedDict
class YamlOrderedDict(OrderedDict):
"""
An OrderedDict that was loaded from a yaml file, and is annotated
with file/line info for reporting about errors in the source file
"""
def _annotate(self, node):
self._key_locs = {}
self._value_locs = {}
nodeiter = node.value.__iter__()
for key in self:
subnode = nodeiter.next()
self._key_locs[key] = subnode[0].start_mark.name + ':' + \
str(subnode[0].start_mark.line+1)
self._value_locs[key] = subnode[1].start_mark.name + ':' + \
str(subnode[1].start_mark.line+1)
def key_loc(self, key):
try:
return self._key_locs[key]
except AttributeError, KeyError:
return ''
def value_loc(self, key):
try:
return self._value_locs[key]
except AttributeError, KeyError:
return ''
# Use YamlOrderedDict objects for yaml maps instead of normal dict
yaml.add_representer(OrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
yaml.add_representer(YamlOrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
def _load_YamlOrderedDict(loader, node):
rv = YamlOrderedDict(loader.construct_pairs(node))
rv._annotate(node)
return rv
yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, _load_YamlOrderedDict)
Now when you read a yaml file, any mapping objects will be read as a YamlOrderedDict, which allows looking up the file location of keys in the mapping object. You can also add an iterator method like:
def iter_with_lines(self):
for key, val in self.items():
yield (key, val, self.key_loc(key))
...and now you can write a loop like:
for key,value,location in obj.iter_with_lines():
# iterate through the key/value pairs in a YamlOrderedDict, with
# the source file location

Resources