How do I properly format code for desired appending output? - python-2.x

I'm writing new code and having problem getting desired output. The code reads an html file and finds tags. it outputs the url only. I insert additional code to complete the link. I'm trying to insert the url two times within the string.
####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href': re.compile("^https://")}):
links.append(''"{link}"'<br>')
time.sleep(.1)
with open("page-2.html", 'w') as html:
html.write('{links}\n'.format(links=links))

This should give you the desired html output file:
import re
from bs4 import BeautifulSoup
import html
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
with open("page2.html", 'w') as h:
for link in soup2.find_all('a'):
h.write("{}<br>".format(link.get('href'),link.get('href')))

This gives me want I want I guess, but not exactly. I would rather see it written out "https://whatever.com/text/text/" than to see "whatever.com/text/text"
####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href': re.compile("^https://")}):
links.append('{0}</a><br>'.format(link,link))
with open("page-2.html", 'w') as html:
html.write('{links}\n'.format(links=links))

Related

Parse quoted-printable encoding content from .mht file

I am trying to get all the images from .mht file by using Nokogiri gem. But since the .mht file has quoted-printable encoding, all the images that I received, has weird characters in it:
<img alt='3D"AFC-Logo' src="3D%22https://upload.=" width='3D"75"' height='3D"75"'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/wikimedia-butto=" width='3D"88"' height='3D"31"' alt='3D"Wikimedia'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/poweredby_mediawiki_8=" alt='3D"Powered' width='3D"88"' height='3D"31"'>
This is the link to that .mht file: https://drive.google.com/file/d/1DtbgrFyCEcggAk1nqpZSluNhRt-k3t95/view?usp=sharing
And below is the code that I am using to get all the images from the .mht file:
html = File.open("1646037951.mht").read
image_links = get_image_links(html)
def get_image_links(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//img[#src]")
raise "No <img .../> tags!" if nodes.empty?
nodes.inject([]) do |uris, node|
puts node.to_s
uris << node.attr('src').strip
end.uniq
end
I have tried to parse it by using .unpack('M').first but it's still not working as it just returns the same result as above.
Or maybe Rails have something for this?

How to include the source line number everywhere in html output in Sphinx?

Let's say I'm writing a custom editor for my RestructuredText/Sphinx stuff, with "live" html output preview. Output is built using Sphinx.
The source files are pure RestructuredText. No code there.
One desirable feature would be that right-clicking on some part of the preview opens the editor at the correct line of the source file.
To achieve that, one way would be to put that line number in every tag of the html file, for example using classes (e.g., class = "... lineno-124"). Or use html comments.
Note that I don't want to add more content to my source files, just that the line number be included everywhere in the output.
An approximate line number would be enough.
Someone knows how to do this in Sphinx, my way or another?
I decided to add <a> tags with a specific class "lineno lineno-nnn" where nnn is the line number in the RestructuredText source.
The directive .. linenocomment:: nnn is inserted before each new block of unindented text in the source, before the actual parsing (using a 'source-read' event hook).
linenocomment is a custom directive that pushes the <a> tag at build time.
Half a solution is still a solution...
import docutils.nodes as dn
from docutils.parsers.rst import Directive
class linenocomment(dn.General,dn.Element):
pass
def visit_linenocomment_html(self,node):
self.body.append(self.starttag(node,'a',CLASS="lineno lineno-{}".format(node['lineno'])))
def depart_linenocomment_html(self,node):
self.body.append('</a>')
class LineNoComment(Directive):
required_arguments = 1
optional_arguments = 0
has_content = False
add_index = False
def run(self):
node = linenocomment()
node['lineno'] = self.arguments[0]
return [node]
def insert_line_comments(app, docname, source):
print(source)
new_source = []
last_line_empty = True
lineno = 0
for line in source[0].split('\n'):
if line.strip() == '':
last_line_empty = True
new_source.append(line)
elif line[0].isspace():
new_source.append(line)
last_line_empty = False
elif not last_line_empty:
new_source.append(line)
else:
last_line_empty = False
new_source.append('.. linenocomment:: {}'.format(lineno))
new_source.append('')
new_source.append(line)
lineno += 1
source[0] = '\n'.join(new_source)
print(source)
def setup(app):
app.add_node(linenocomment,html=(visit_linenocomment_html,depart_linenocomment_html))
app.add_directive('linenocomment', LineNoComment)
app.connect('source-read',insert_line_comments)
return {
'version': 0.1
}

Declare additional dependency to sphinx-build in an extension

TL,DR: From a Sphinx extension, how do I tell sphinx-build to treat an additional file as a dependency? In my immediate use case, this is the extension's source code, but the question could equally apply to some auxiliary file used by the extension.
I'm generating documentation with Sphinx using a custom extension. I'm using sphinx-build to build the documentation. For example, I use this command to generate the HTML (this is the command in the makefile generated by sphinx-quickstart):
sphinx-build -b html -d _build/doctrees . _build/html
Since my custom extension is maintained together with the source of the documentation, I want sphinx-build to treat it as a dependency of the generated HTML (and LaTeX, etc.). So whenever I change my extension's source code, I want sphinx-build to regenerate the output.
How do I tell sphinx-build to treat an additional file as a dependency? That is not mentioned in the toctree, since it isn't part of the source. Logically, this should be something I do from my extension's setup function.
Sample extension (my_extension.py):
from docutils import nodes
from docutils.parsers.rst import Directive
class Foo(Directive):
def run(self):
node = nodes.paragraph(text='Hello world\n')
return [node]
def setup(app):
app.add_directive('foo', Foo)
Sample source (index.rst):
.. toctree::
:maxdepth: 2
.. foo::
Sample conf.py (basically the output of sphinx-quickstart plus my extension):
import sys
import os
sys.path.insert(0, os.path.abspath('.'))
extensions = ['my_extension']
templates_path = ['_templates']
source_suffix = '.rst'
master_doc = 'index'
project = 'Hello directive'
copyright = '2019, Gilles'
author = 'Gilles'
version = '1'
release = '1'
language = None
exclude_patterns = ['_build']
pygments_style = 'sphinx'
todo_include_todos = False
html_theme = 'alabaster'
html_static_path = ['_static']
htmlhelp_basename = 'Hellodirectivedoc'
latex_elements = {
}
latex_documents = [
(master_doc, 'Hellodirective.tex', 'Hello directive Documentation',
'Gilles', 'manual'),
]
man_pages = [
(master_doc, 'hellodirective', 'Hello directive Documentation',
[author], 1)
]
texinfo_documents = [
(master_doc, 'Hellodirective', 'Hello directive Documentation',
author, 'Hellodirective', 'One line description of project.',
'Miscellaneous'),
]
Validation of a solution:
Run make html (or sphinx-build as above).
Modify my_extension.py to replace Hello world by Hello again.
Run make html again.
The generated HTML (_build/html/index.html) must now contain Hello again instead of Hello world.
It looks like the note_dependency method in the build environment API should do what I want. But when should I call it? I tried various events but none seemed to hit the environment object in the right state. What did work was to call it from a directive.
import os
from docutils import nodes
from docutils.parsers.rst import Directive
import sphinx.application
class Foo(Directive):
def run(self):
self.state.document.settings.env.note_dependency(__file__)
node = nodes.paragraph(text='Hello done\n')
return [node]
def setup(app):
app.add_directive('foo', Foo)
If a document contains at least one foo directive, it'll get marked as stale when the extension that introduces this directive changes. This makes sense, although it could get tedious if an extension adds many directives or makes different changes. I don't know if there's a better way.
Inspired by Luc Van Oostenryck's autodoc-C.
As far as I know app.env.note_dependency can be called within the doctree-read to add any file as a dependency to the document currently being read.
So in your use case, I assume this would work:
from typing import Any, Dict
from sphinx.application import Sphinx
import docutils.nodes as nodes
def doctree-read(app: Sphinx, doctree: nodes.document):
app.env.note_dependency(file)
def setup(app: Sphinx):
app.connect("doctree-read", doctree-read)

Is it possible to reuse hyperlink defined in another file in restructuredtext (or sphinx)

Suppose I have two files a.rst and b.rst in the same folder, and a.rst looks like this
.. _foo: http://stackoverflow.com
`foo`_ is a website
It seems using foo in b.rst is not allowed. Is there a way to define hyperlinks and use them in multiple files?
Followup
I used the extlinks extension as Steve Piercy suggested. Its implementation and docstring can be seen here on github.
In my case, I define wikipedia link in my conf.py
extlinks = {'wiki': ('https://en.wikipedia.org/wiki/%s', '')}
and in the .rst files, use them like
:wiki:`Einstein <Albert_Einstein>`
where Einstein will be displayed as a link to https://en.wikipedia.org/wiki/Albert_Einstein
There are at least four possible solutions.
1. repeat yourself
Put your complete reST in each file. You probably don't want that.
2. combined rst_epilog and substitution
This one is clever. Configure the rst_epilog value, in your conf.py along with a substition with the replace directive:
rst_epilog = """
.. |foo| replace:: foo
.. _foo: http://stackoverflow.com
"""
and reST:
|foo|_ is a website
yields:
<a class="reference external" href="http://stackoverflow.com">foo</a>
3. extlinks
For links to external websites where you want to have a base URL and append path segments or arguments, you can use extlinks in your conf.py:
extensions = [
...
'sphinx.ext.extlinks',
...
]
...
extlinks = {'so': ('https://stackoverflow.com/%s', None)}
Then in your reST:
:so:`questions/49016433`
Yields:
<a class="reference external"
href="https://stackoverflow.com/questions/49016433">
https://stackoverflow.com/questions/49016433
</a>
4. intersphinx
For external websites that are documentation generated by Sphinx, then you can use intersphinx, in your conf.py:
extensions = [
...
'sphinx.ext.intersphinx',
...
]
...
intersphinx_mapping = {
'python': ('https://docs.python.org/3', None),
}
Then in your reST:
:py:mod:`doctest`
Yields:
<a class="reference external"
href="https://docs.python.org/3/library/doctest.html#module-doctest"
title="(in Python v3.6)">
<code class="xref py py-mod docutils literal">
<span class="pre">doctest</span>
</code>
</a>
This might come a bit late but I have found a solution that works very neatly for me and it is not among the answers already given.
In my case, I create a file with all the links used in my project, save it as /include/links.rst, and looking something like:
.. _PEP8: https://www.python.org/dev/peps/pep-0008/
.. _numpydoc: https://numpydoc.readthedocs.io/en/latest/format.html
.. _googledoc: https://google.github.io/styleguide/pyguide.html
Then there are the files a.rst and b.rst looking like:
.. include:: /include/links.rst
File A.rst
##########
Click `here <PEP8_>`_ to see the PEP8 coding style
Alternatively, visit either:
- `Numpy Style <numpydoc_>`_
- `Google Style <googledoc_>`_
and
.. include:: /include/links.rst
File B.rst
##########
You can visit `Python's PEP8 Style Guide <PEP8_>`_
For docstrings, you can use either `Numpy's <numpydoc_>`_ or `Google's <googledoc_>`_
respectively.
The produced output for both cases is:
and
respectively.
Moreover, I would like to emphasize the fact of which I was actually really struggling to achieve, to use different names (displayed text) for the same link at different locations and which I have achieved with the double _, one inside the <..._> and another outside.
This is another solution: it is a bit hacky and a little bit different respect to the officially supported way to share external links.
First complete the Setup then:
in conf.py add the commonlinks entry in extensions
in conf.py configure the map of common links:
For example:
extensions = [
...,
'sphinx.ext.commonlinks'
]
commonlinks = {
'issues': 'https://github.com/sphinx-doc/sphinx/issues',
'github': 'https://github.com'
}
Then in .rst files you can do these:
The :github:`_url_` url is aliased to :github:`GitHub` and also to :github:`this`
Setup
All that is needed is to copy into sphinx/ext directory the file commonlinks.py:
# -*- coding: utf-8 -*-
"""
sphinx.ext.commonlinks
~~~~~~~~~~~~~~~~~~~~~~
Extension to save typing and prevent hard-coding of common URLs in the reST
files.
This adds a new config value called ``commonlinks`` that is created like this::
commonlinks = {'exmpl': 'http://example.com/mypage.html', ...}
Now you can use e.g. :exmpl:`foo` in your documents. This will create a
link to ``http://example.com/mypage.html``. The link caption depends on the
role content:
- If it is ``_url_``, the caption will be the full URL.
- If it is a string, the caption will be the role content.
"""
from six import iteritems
from docutils import nodes, utils
import sphinx
from sphinx.util.nodes import split_explicit_title
def make_link_role(base_url):
def role(typ, rawtext, text, lineno, inliner, options={}, content=[]):
text = utils.unescape(text)
if text == '_url_':
title = base_url
else:
title = text
pnode = nodes.reference(title, title, internal=False, refuri=base_url)
return [pnode], []
return role
def setup_link_roles(app):
for name, base_url in iteritems(app.config.commonlinks):
app.add_role(name, make_link_role(base_url))
def setup(app):
app.add_config_value('commonlinks', {}, 'env')
app.connect('builder-inited', setup_link_roles)
return {'version': sphinx.__display_version__, 'parallel_read_safe': True}
To locate the sphinx installation directory one way is:
$ python 3
> import sphinx
> sphinx
<module 'sphinx' from '/usr/local/lib/python3.5/dist-packages/sphinx/__init__.py'>
then:
% cp commonlinks.py /usr/local/lib/python3.5/dist-packages/sphinx/ext

Ruby - CSV works while SmarteCSV doesn't

I want to open a csv file using SmarterCSV.process
market_csv = SmarterCSV.process(market)
p "just read #{market_csv}"
The problem is that the data is not read and this prints:
[]
However, if I attempt the same thing with the default CSV library implementation the content of the file is read(the following print statement prints the file).
CSV.foreach(market) do |row|
p row
end
The content of the file I was reading is of the form:
Date,Close
03/06/15,0.1634
02/06/15,0.1637
01/06/15,0.1638
31/05/15,0.1638
The problem could come from the line separator, the file is not exactly the same if you're using windows or unix system ("\r\n" or "\r"). Try to identify and specify the character in the SmarterCSV.process like this:
market_csv = SmarterCSV.process(market, row_sep: "\r")
p "just read #{market_csv}"
or like this:
market_csv = SmarterCSV.process(market, row_sep: :auto)
p "just read #{market_csv}"

Resources