restructuredtext include partial rst directives from other files (i.e. templates) - python-sphinx

I'm looking for a way to include partial rst directives from another file.
I have some restructured text files that have repeating table definitions, and I would like to make this a reusable template to avoid redefining the same table again and again.
multiple files list tables that all start like to following (i.e. common settings and header) but the actual data rows afterwards differ.
.. list-table::
:align: left
:header-rows: 1
:width: 100%
:widths: 25 25 25 25
* - ColumnHeader1
- ColumnHeader2
- ColumnHeader3
- ColumnHeader4
The include directive does not seem to work for this, as it will parse the line-table and then include the result, I cannot add additional rows to that line-table afterwards.
Is there any way that I could achieve this?
EDIT:
Forgot to mention that I use restructuredtext as part of sphinx. It seems it is quite easy to extend sphinx: https://www.sphinx-doc.org/en/master/development/tutorials/helloworld.html
EDIT2:
Moved my solution to an answer to this question

I found a way to implement this as a sphinx extension (following this tutorial https://www.sphinx-doc.org/en/master/development/tutorials/helloworld.html)
from docutils.parsers.rst.directives.tables import ListTable
from docutils.statemachine import ViewList
class MyTable(ListTable):
def run(self):
header_rows = ViewList([
'* - ColumnHeader1',
' - ColumnHeader2',
' - ColumnHeader3',
' - ColumnHeader4',
])
self.content = header_rows + self.content
self.options['header-rows'] = 1
self.options['width'] = '100%'
self.options['widths'] = '25 25 25 25'
self.options['align'] = 'left'
return super().run()
def setup(app):
app.add_directive('my-table', MyTable)
return {
'version': '0.1',
'parallel_read_safe': True,
'parallel_write_safe': True,
}
This let's me define tables like this now:
.. my-table::
* - Column 1
- Column 2
- Column 3
- Column 4

Related

Combine/Merge every x images together with title of each pict (linux)

I have some folders containing many jpg pictures (number depends on the folder)
I would like for instance to combine every 4 pict** together with the title of the image (see pict below).
(In case there is not exactly 4 image on the last sequence, I should get the number of left picture such as 3 2 or 1)
**Ideally I could change that number to other numbers like 5 6 10 (the number I chose would depend on the context) and I could chose the number of columns (I showed 2 column in my example below)
How can i perform this on Linux command or any Linux free/open-source software?
As I did not find what I want I created my own python code to solve this (it's probably not the most perfects script of the century but it works)
"""
Prints a collage according to desired number of column and rows with title of file
Instruction
1. Put all jpg picture in same folder [tested sucessfully on 12mb per pict]
2. select desired columns in NO_COL
3. select desired rowsin in NO_ROW
4. run the script which will output the collage with <cur_date>_export.png files
"""
#import libraries
import time
import os
import imageio as iio
from matplotlib import pyplot as plt
def render_collage(pict_file_name_list):
""" create one collage """
fig = plt.figure(figsize=(40, 28)) #change if needed
cnt = 1
for cur_img_name in pict_file_name_list:
img_var = iio.imread(cur_img_name)
fig.add_subplot(NO_COL, NO_ROW, cnt)
plt.imshow(img_var)
plt.axis('off')
plt.title(cur_img_name, fontsize = 30) #change if needed
cnt = cnt + 1
cur_date = time.strftime("%Y-%m-%d--%H-%M-%s")
fig.savefig(cur_date+'_export.png')
NO_COL = 3
NO_ROW = 3
NBR_IMG_COLLAGE = NO_COL * NO_ROW
img_list_name = [elem for elem in os.listdir() if 'jpg' in elem] #keep only file having .jpg
while len(img_list_name) >= 1:
sub_list = img_list_name[:NBR_IMG_COLLAGE]
render_collage(sub_list)
del img_list_name[:NBR_IMG_COLLAGE]

Switching from beautifulsoup to htmlelement - how to find elements

I have an exsiting process that extracts elements from html documents that make use of the xbrli xml standard.
And example of a document can be found here:
The process works well (I'm using multiprocessing to work in parallel) but I have ~20m html and xml files to process and I'm finding beautifulsoup is the core bottleneck.
I am looking at htmlelement as a hopefully quicker alternative to extracting the data I need but I'm struggling to find elements. For example, in BS I can do the following:
for tag in soup.find_all('xbrli:unit'):
l_unitid = tag.attrs.get('id')
l_value = tag.text
l_unit_dict[l_unitid] = {'unitid':l_unitid,'value':l_value}
Which will find all xbrli:unit tags and I can extract their values easily.
However, when I try something similar in htmlelement I get the following exception:
import htmlement
source = htmlement.parse("Prod223_2542_00010416_20190331.html")
for tag in source.iterfind('.//xbrli:unit'):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
SyntaxError: prefix 'xbrli' not found in prefix map
A bit of googling led me to a few articles, but I can't seem to make progress
SyntaxError: prefix 'a' not found in prefix map
Parsing XML with namespace in Python via 'ElementTree'
I've tried adding in a namespace map but it's just not finding anything, no matter which way round I put things, or what tags I look for
source = htmlement.parse("Prod223_2542_00010416_20190331.html")
namespaces = {'xbrli': 'period'}
for tag in source.iterfind('.//xbrli:period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
namespaces = {'xbrli': 'period'}
for tag in source.iterfind('.//{xbrli}period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
namespaces = {'period':'xbrli'}
for tag in source.iterfind('.//{xbrli}period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
namespaces = {'period':'xbrli'}
for tag in source.iterfind('.//period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
All return nothing - they don't enter the loop. I've clearly got something very wrong in my understanding of how to use the elementree structure vs BS, but I don't quite know how to move from one to the other.
Any suggestions would be welcome.
Two general comments before I get to a proposed answer:
First, you are dealing with an xml document, so it's generally better to use an xml, not html, parser. So that's what I'm using below instead of beautifull soup or htmlelement.
Second, about xbrl generally: from bitter experience (and as many others pointed out), xbrl is terrible. It's shiny on the surface, but once you pop the hood, it's a mess. So I don't envy you...
And, with that said, I tried to approximate what you are likely looking for. I didn't bother to create dictionaries or lists, and just used print() statement. Obviously, if it helps you, you can modify it to your own requirements:
from lxml import etree
import requests
r = requests.get('https://beta.companieshouse.gov.uk/company/00010416/filing-history/MzI1MTU3MzQzMmFkaXF6a2N4/document?format=xhtml&download=1')
root = etree.fromstring(r.content)
units = root.xpath(".//*[local-name()='unit'][#id]/#id")
for unit in units:
unit_id = unit
print('unit: ', unit)
print('----------------------------')
context = root.xpath(".//*[local-name()='context']")
for tag in context:
id = tag.xpath('./#id')
print('ID: ',id)
info = tag.xpath('./*[local-name()="entity"]')
identifier = info[0].xpath('.//*[local-name()="identifier"]')[0].text
print('identifier: ',identifier)
member = info[0].xpath('.//*[local-name()="explicitMember"]')
if len(member)>0:
dimension = member[0].attrib['dimension']
explicitMember = member[0].text
print('dimension: ',dimension,' explicit member: ',explicitMember)
periods = tag.xpath('.//*[local-name()="period"]')
for period in periods:
for child in period.getchildren():
if 'instant' in child.tag:
instant = child.text
print('instant: ',instant)
else:
dates = period.xpath('.//*')
start_date = dates[0].text
end_date = dates[1].text
print('start date: ', start_date,' end date: ',end_date)
print('===================')
A random sample from the output:
ID: ['cfwd_31_03_2018']
identifier: 00010416
instant: 2018-03-31
start date: 2017-04-01 end date: 2018-03-31
===================
ID: ['CountriesHypercube_FY_31_03_2019_Set1']
identifier: 00010416
dimension: ns15:CountriesRegionsDimension explicit member: ns15:EnglandWales
instant: 2018-03-31
start date: 2018-04-01 end date: 2019-03-31

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

Ruby splitting a record into multiple records based on contents of a field

Record layout contains two fields:
Requistion
Test Names
Example record:
R00000001,"4 Calprotectin, 1 Luminex xTAG, 8 H. pylori stool antigen (IgA), 9 Lactoferrin, 3 Anti-gliadin IgA, 10 H. pylori Panel, 6 Fecal Fat, 11 Antibiotic Resistance Panel, 2 C. difficile Tox A/ Tox B, 5 Elastase, 7 Fecal Occult Blood, 12 Shigella"
The current Ruby code snippet that is used in the LIMS (Lab Info Management System) system is this:
subj.get_value('Tests').join(', ')
What I need to be able to do in the Ruby code snippet is create a new record off each comma-separated value in the second field.
NOTE:
the amount of values in the 'Test Names' field varies from 1 to 20...or more.
There can be 100's of Requistion records
Final result would be:
R00000001,"4 Calprotectin"
R00000001,"1 Luminex xTAG"
R00000001,"8 H. pylori stool antigen (IgA)"
R00000001,"9 Lactoferrin"
R00000001,"3 Anti-gliadin IgA"
R00000001,"10 H. pylori Panel"
R00000001,"6 Fecal Fat"
R00000001,"11 Antibiotic Resistance Panel"
R00000001,"2 C. difficile Tox A/ Tox B"
R00000001,"5 Elastase"
R00000001,"7 Fecal Occult Blood"
R00000001,"12 Shigella"
If your data is a reliable string which you've shown in your example, here's your method:
data = subj.get_value('Tests').join(', ') # assuming this gives your string obj.
def split_data(data)
arr = data.gsub('"','').split(',')
arr.map {|l| "#{arr[0]} \"#{l.strip}\""}[1..-1]
end
puts split_data(data)

How do I get 64 bit ids to work with Sphinx search server 0.9.9 on Mac OS?

I've been using Sphinx successfully for a while, but just ran into an issue that's got me confused... I back Sphinx with mysql queries and recently migrated my primary key strategy in a way that had the ids of the tables I'm indexing grow larger than 32 bits (in MYSQL they're bigint unsigned). Sphinx was getting index hits, but returning me nonsense ids (presumably 32 bits of the id returned by the queries or something)..
I looked into it, and realized I hadn't passed the --enable-id64 flag to ./configure. No problem, completely rebuilt sphinx with that flag (I'm running 0.9.9 by the way). No change though! I'm still experiencing the exact same issue. My test scenario is pretty simple:
MySQL:
create table test_sphinx(id bigint unsigned primary key, text varchar(200));
insert into test_sphinx values (10102374447, 'Girls Love Justin Beiber');
insert into test_sphinx values (500, 'But Small Ids are working?');
Sphinx conf:
source new_proof
{
type = mysql
sql_host = 127.0.0.1
sql_user = root
sql_pass = password
sql_db = testdb
sql_port =
sql_query_pre =
sql_query_post =
sql_query = SELECT id, text FROM test_sphinx
sql_query_info = SELECT * FROM `test_sphinx` WHERE `id` = $id
sql_attr_bigint = id
}
index new_proof
{
source = new_proof
path = /usr/local/sphinx/var/data/new_proof
docinfo = extern
morphology = none
stopwords =
min_word_len = 1
charset_type = utf-8
enable_star = 1
min_prefix_len = 0
min_infix_len = 2
}
Searching:
→ search -i new_proof beiber
Sphinx 0.9.9-release (r2117)
...
index 'new_proof': query 'beiber ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=1512439855, weight=1
(document not found in db)
words:
1. 'beiber': 1 documents, 1 hits
→ search -i new_proof small
Sphinx 0.9.9-release (r2117)
...
index 'new_proof': query 'small ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=500, weight=1
id=500
text=But Small Ids are working?
words:
1. 'small': 1 documents, 1 hits
Anyone have an idea about why this is broken?
Thanks in advance
-Phill
EDIT
Ah. Okay, got further. I didn't mention that I've been doing all of this testing on Mac OS. It looks like that may be my problem. I just compiled in 64 bit on linux and it works great.. There's also a clue when I run the Sphinx command line commands that the compile didn't take:
My Mac (broken)
Sphinx 0.9.9-release (r2117)
Linux box (working)
Sphinx 0.9.9-id64-release (r2117)
So I guess the new question is what's the trick to compiling for 64 bit keys on mac os?
Did you rebuild the index with 64 bits indexer?

Resources