Is there a Spinx option to prevent parsing of URL's globally? - python-sphinx

related to: How do I prevent sphinx from making a url a hyperlink?
In the question above we learn how to escape individual URL's in reStructured Text to prevent Sphinx from turning them into hyperlinks when converting to HTML. However I have a lot of URL's and I would like to keep my .rst files as clean as possible. It is an API documentation, so adding backslashes or quotes makes it less readable. Is there a config option to prevent Sphinx from parsing URL's altogether?

Unfortunately, I don't think there's any easy way. Implicitly detecting URIs happens at the reST-parsing layer:
https://github.com/qsnake/docutils/blob/68af50cccd2c8bb88264bffad44faa8e47e5d7dc/docutils/parsers/rst/states.py#L627
Sphinx is a set of predefined domains & related tooling on top of docutils' reST implementation, so this is lower-level than the things it provides config options for.
There might be some way of getting the HTML writer to not emit <a> tags out the output side of things, but my guess is that even if it's possible, it's likely to be pretty involved.

Related

HTML/XSS escape on input vs output

From everything I've seen, it seems like the convention for escaping html on user-entered content (for the purposes of preventing XSS) is to do it when rendering content. Most templating languages seem to do it by default, and I've come across things like this stackoverflow answer arguing that this logic is the job of the presentation layer.
So my question is, why is this the case? To me it seems cleaner to escape on input (i.e. form or model validation) so you can work under the assumption that anything in the database is safe to display on a page, for the following reasons:
Variety of output formats - for a modern web app, you may be using a combination of server-side html rendering, a JavaScript web app using AJAX/JSON, and mobile app that receives JSON (and which may or may not have some webviews, which may be JavaScript apps or server-rendered html). So you have to deal with html escaping all over the place. But input will always get instantiated as a model (and validated) before being saved to db, and your models can all inherit from the same base class.
You already have to be careful about input to prevent code-injection attacks (granted this is usually abstracted to the ORM or db cursor, but still), so why not also worry about html escaping here so you don't have to worry about anything security-related on output?
I would love to hear the arguments as to why html escaping on page render is preferred
In addition to what has been written already:
Precisely because you have a variety of output formats, and you cannot guarantee that all of them will need HTML escaping. If you are serving data over a JSON API, you have no idea whether the client needs it for a HTML page or a text output (e.g. an email). Why should you force your client to unescape "Jack & Jill" to get "Jack & Jill"?
You are corrupting your data by default.
When someone does a keyword search for 'amp', they get "Jack & Jill". Why? Because you've corrupted your data.
Suppose one of the inputs is a URL: http://example.com/?x=1&y=2. You want to parse this URL, and extract the y parameter if it exists. This silently fails, because your URL has been corrupted into http://example.com/?x=1&y=2.
It's simply the wrong layer to do it - HTML related stuff should not be mixed up with raw HTTP handling. The database shouldn't be storing things that are related to one possible output format.
XSS and SQL Injection are not the only security problems, there are issues for every output you deal with - such as filesystem (think extensions like '.php' that cause web servers to execute code) and SMTP (think newline characters), and any number of others. Thinking you can "deal with security on input and then forget about it" decreases security. Rather you should be delegating escaping to specific backends that don't trust their input data.
You shouldn't be doing HTML escaping "all over the place". You should be doing it exactly once for every output that needs it - just like with any escaping for any backend. For SQL, you should be doing SQL escaping once, same goes for SMTP etc. Usually, you won't be doing any escaping - you'll be using a library that handles it for you.
If you are using sensible frameworks/libraries, this is not hard. I never manually apply SQL/SMTP/HTML escaping in my web apps, and I never have XSS/SQL injection vulnerabilities. If your method of building web pages requires you to remember to apply escaping, or end up with a vulnerability, you are doing it wrong.
Doing escaping at the form/http input level doesn't ensure safety, because nothing guarantees that data doesn't get into your database or system from another route. You've got to manually ensure that all inputs to your system are applying HTML escaping.
You may say that you don't have other inputs, but what if your system grows? It's often too late to go back and change your decision, because by this time you've got a ton of data, and may have compatibility with external interfaces e.g. public APIs to worry about, which are all expecting the data to be HTML escaped.
Even web inputs to the system are not safe, because often you have another layer of encoding applied e.g. you might need base64 encoded input in some entry point. Your automatic HTML escaping will miss any HTML encoded within that data. So you will have to do HTML escaping again, and remember to do, and keep track of where you have done it.
I've expanded on these here: http://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/
The original misconception
Do not confuse sanitation of output with validation.
While <script>alert(1);</script> is a perfectly valid username, it definitely must be escaped before showing on the website.
And yes, there is such a thing as "presentation logic", which is not related to "domain business logic". And said presentation logic is what presentation layer deals with. And the View instances in particular. In a well written MVC, Views are full-blown objects (contrary to what RoR would try to to tell you), which, when applied in web context, juggle multiple templates.
About your reasons
Different output formats should be handled by different views. The rules and restrictions, which govern HTML, XML, JSON and other formats, are different in each case.
You always need to store the original input (sanitized to avoid injections, if you are not using prepared statements), because someone might need to edit it at some point.
And storing original and the xss-safe "public" version is waste. If you want to store sanitized output, because it takes too much resources to sanitize it each time, then you are already pissing at the wrong tree. This is a case, when you use cache, instead of polluting the database.

ruby markdown parser with WikiWord support?

I am using git-wiki for my personal note storage. It works very well, except that WikiWords are converted to links before the markdown parsing stage, using a regular expression. This messes up scores of things, for instance links that point to outside wiki pages, or block quotes (if I am quoting something, I do not want a WikiWord to be changed into a link).
Are there ruby-based Markdown parsers that understand WikiLinks?
The best parser around is the C-based one (upskirt/sundown), whose ruby iteration is red carpet:
https://github.com/tanoku/redcarpet
It is better for performance and security reasons.
For the wiki links, pre-process them before sending your text to the markdown parser.

Should you / can you store html / erb code in a database?

I am looking to create an application effectively like a very simple blog - with another article being added every few days. In order to control the appearance of the articles, I would like to write them directly using html so picures, links etc can be put in appropriate places and formatted as required. However, storing these articles in a database would seem to make the maintenance a lot easier and provide additional searching capabilities. Is it sensible to store such html / erb code in a database in this way? If not what are the alternatives?
The standard way to do this is to use a markup library, such as redcloth, to do this. You use something similar even here on SO - it looks like markdown, if you read the help link. The content can certainly be put in a database. It can be done with html and erb, but the reason it is not often done is safety.
If you are the only one using it, it may not be an issue, but if you allow anyone else to insert data, you can open yourself up to XSS attacks with html, or even code exploits if you allowed raw erb. Markup languages exist to limit the set of markup allowed and to remove the ability for scripting attacks.
see also: Better ruby markdown interpreter?
Update: What great timing, there was a railscast released today about this: http://railscasts.com/episodes/272-markdown-with-redcarpet

Insert a hyperlink to another file (Word) into Visual Studio code file

I am currently developing some functionality that implements some complex calculations. The calculations themselves are explained and defined in Word documents.
What I would like to do is create a hyperlink in each code file that references the assocciated Word document - just as you can in Word itself. Ideally this link would be placed in or near the XML comments for each class.
The files reside on a network share and there are no permissions to worry about.
So far I have the following but it always comes up with a file not found error.
file:///\\165.195.209.3\engdisk1\My Tool\Calculations\111-07 MyToolCalcOne.docx
I've worked out the problem is due to the spaces in the folder and filenames.
My Tool
111-07 MyToolCalcOne.docx
I tried replacing the spaces with %20, thus:
file:///\\165.195.209.3\engdisk1\My%20Tool\Calculations\111-07%20MyToolCalcOne.docx
but with no success.
So the question is; what can I use in place of the spaces?
Or, is there a better way?
One way that works beautifully is to write your own URL handler. It's absolutely trivial to do, but so very powerful and useful.
A registry key can be set to make the OS execute a program of your choice when the registered URL is launched, with the URL text being passed in as a command-line argument. It just takes a few trivial lines of code to will parse the URL in any way you see fit in order to locate and launch the documentation.
The advantages of this:
You can use a much more compact and readable form, e.g. mydocs://MyToolCalcOne.docx
A simplified format means no trouble trying to encode tricky file paths
Your program can search anywhere you like for the file, making the document storage totally portable and relocatable (e.g. you could move your docs into source control or onto a website and just tweak your URL handler to locate the files)
Your URL is unique, so you can differentiate files, web URLs, and documentation URLs
You can register many URLs, so can use different ones for specs, designs, API documentation, etc.
You have complete control over how the document is presented (does it launch Word, an Internet Explorer, or a custom viewer to display the docs, for example?)
I would advise against using spaces in filenames and URLs - spaces have never worked properly under Windows, and always cause problems (or require ugliness like %20) sooner or later. The easiest and cleanest solution is simply to remove the spaces or replace them with something like underscores, dashes or periods.

Why not use HTML tags in websites' text editors?

I may need to implement this sometime in the future, but I think the trigger for the question now is mainly curiosity.
I thought of how to write a text editor to a web site I'll build soon, and saw this site's (and other's) way, so I thought - isn't it a bit too complicated? If tags should be used from the first place, why not let users use HTML tags? The only reason I can think of is HTML injection which I don't know much about, but it sounds like an easy issue to solve, isn't it?
Thank you.
Simply because not all of your users will know HTML. *bold text* is a lot more easy to understand (and read in it's raw form) than <b>bold text</b>. Especially if you get into links.
The reason we use Markdown, Textile and the rest is to provide a nice alternative that's accessible to more users.
Of course you can still provide the ability to use HTML to your users (it's in the Markdown spec) but you'll have to do a lot of checking to make sure there's nothing malicious going on - for example, blocking <script>, <iframe>, large images, javascript in the form <a href="javascript:alert("...");"> etc.
There are several reason why you should not use HTML tags in such an editor:
1) It might be less complex for the user if you introduce an own reduced tag set
2) HTML Injection: There is a big risk of dangerous HTML code getting injected.
If you really want to allow HTML code you have to be very careful.
Historically, systems like BBCode were designed to limit available formatting elements to things that would not break the layout of the site, but now, with more mature and smarter HTML parsers, it's not necessary to invent a new markup language just to bar certain un-safe HTML tags.
The current main reason I've seen is that HTML is foreign to most users, and the HTML substitutes are aimed at providing a simplified version of the formatting directives an every-day user would need.
HTML script injection is most emphatically not an easy problem to solve. HTML is a fairly complicated, non-regular language - detecting all possible vulnerabilities is a really hard problem. Many sites have tried, and failed. It's easier, from a vulnerability-prevention POV, to just prohibit HTML entirely, or allow only a small subset of tags.

Resources