I'm using HtmlAgilityPack to sanitize user entered rich text and strip any harmful/unwanted text. Problem occurs though when a simple text is also treated as html node
If I enter
a<b, c>d
and try to sanitize it, the output generated is
a<b, c="">d</b,>
The code I used was
HtmlDocument doc = new HthmlDocument();
doc.LoadHtml(value);
// Sanitizing Logic
var result = doc.DocumentNode.WriteTo();
I tried to set different parameters on HtmlDocument ('OptionCheckSyntax', 'OptionAutoCloseOnEnd', 'OptionWriteEmptyNodes') to not have the text be treated as a node but nothing worked. Is this is a known issue or any workaround possible?
IMO, there's no way you can tell HAP to not treat every '<' as start of new html node. But you can check if your html is a validate html or not by using
string html = "your-html";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
if (doc.ParseErrors.Count() > 0)
{
//here you can ignore or do whatever you want
}
Is there a way to create a link in Markdown that opens in a new window? If not, what syntax do you recommend to do this? I'll add it to the markdown compiler I use. I think it should be an option.
As far as the Markdown syntax is concerned, if you want to get that detailed, you'll just have to use HTML.
Hello, world!
Most Markdown engines I've seen allow plain old HTML, just for situations like this where a generic text markup system just won't cut it. (The StackOverflow engine, for example.) They then run the entire output through an HTML whitelist filter, regardless, since even a Markdown-only document can easily contain XSS attacks. As such, if you or your users want to create _blank links, then they probably still can.
If that's a feature you're going to be using often, it might make sense to create your own syntax, but it's generally not a vital feature. If I want to launch that link in a new window, I'll ctrl-click it myself, thanks.
Kramdown supports it. It's compatible with standard Markdown syntax, but has many extensions, too. You would use it like this:
[link](url){:target="_blank"}
I don't think there is a markdown feature, although there may be other options available if you want to open links which point outside your own site automatically with JavaScript.
Array.from(javascript.links)
.filter(link => link.hostname != window.location.hostname)
.forEach(link => link.target = '_blank');
jsFiddle.
If you're using jQuery:
$(document.links).filter(function() {
return this.hostname != window.location.hostname;
}).attr('target', '_blank');
jsFiddle.
With Markdown v2.5.2, you can use this:
[link](URL){:target="_blank"}
So, it isn't quite true that you cannot add link attributes to a Markdown URL. To add attributes, check with the underlying markdown parser being used and what their extensions are.
In particular, pandoc has an extension to enable link_attributes, which allow markup in the link. e.g.
[Hello, world!](http://example.com/){target="_blank"}
For those coming from R (e.g. using rmarkdown, bookdown, blogdown and so on), this is the syntax you want.
For those not using R, you may need to enable the extension in the call to pandoc with +link_attributes
Note: This is different than the kramdown parser's support, which is one the accepted answers above. In particular, note that kramdown differs from pandoc since it requires a colon -- : -- at the start of the curly brackets -- {}, e.g.
[link](http://example.com){:hreflang="de"}
In particular:
# Pandoc
{ attribute1="value1" attribute2="value2"}
# Kramdown
{: attribute1="value1" attribute2="value2"}
^
^ Colon
One global solution is to put <base target="_blank">
into your page's <head> element. That effectively adds a default target to every anchor element. I use markdown to create content on my Wordpress-based web site, and my theme customizer will let me inject that code into the top of every page. If your theme doesn't do that, there's a plug-in
Not a direct answer, but may help some people ending up here.
If you are using GatsbyJS there is a plugin that automatically adds target="_blank" to external links in your markdown.
It's called gatsby-remark-external-links and is used like so:
yarn add gatsby-remark-external-links
plugins: [
{
resolve: `gatsby-transformer-remark`,
options: {
plugins: [{
resolve: "gatsby-remark-external-links",
options: {
target: "_blank",
rel: "noopener noreferrer"
}
}]
}
},
It also takes care of the rel="noopener noreferrer".
Reference the docs if you need more options.
For ghost markdown use:
[Google](https://google.com" target="_blank)
Found it here:
https://cmatskas.com/open-external-links-in-a-new-window-ghost/
I'm using Grav CMS and this works perfectly:
Body/Content:
Some text[1]
Body/Reference:
[1]: http://somelink.com/?target=_blank
Just make sure that the target attribute is passed first, if there are additional attributes in the link, copy/paste them to the end of the reference URL.
Also work as direct link:
[Go to this page](http://somelink.com/?target=_blank)
You can do this via native javascript code like so:
var pattern = /a href=/g;
var sanitizedMarkDownText = rawMarkDownText.replace(pattern,"a target='_blank' href=");
JSFiddle Code
In my project I'm doing this and it works fine:
[Link](https://example.org/ "title" target="_blank")
Link
But not all parsers let you do that.
There's no easy way to do it, and like #alex has noted you'll need to use JavaScript. His answer is the best solution but in order to optimize it, you might want to filter only to the post-content links.
<script>
var links = document.querySelectorAll( '.post-content a' );
for (var i = 0, length = links.length; i < length; i++) {
if (links[i].hostname != window.location.hostname) {
links[i].target = '_blank';
}
}
</script>
The code is compatible with IE8+ and you can add it to the bottom of your page. Note that you'll need to change the ".post-content a" to the class that you're using for your posts.
As seen here: http://blog.hubii.com/target-_blank-for-links-on-ghost/
If someone is looking for a global rmarkdown (pandoc) solution.
Using Pandoc Lua Filter
You could write your own Pandoc Lua Filter which adds target="_blank" to all links:
Write a Pandoc Lua Filter, name it for example links.lua
function Link(element)
if
string.sub(element.target, 1, 1) ~= "#"
then
element.attributes.target = "_blank"
end
return element
end
Then update your _output.yml
bookdown::gitbook:
pandoc_args:
- --lua-filter=links.lua
Inject <base target="_blank"> in Header
An alternative solution would be to inject <base target="_blank"> in the HTML head section using the includes option:
Create a new HTML file, name it for example links.html
<base target="_blank">
Then update your _output.yml
bookdown::gitbook:
includes:
in_header: links.html
Note: This solution may also open new tabs for hash (#) pointers/URLs. I have not tested this solution with such URLs.
In Laravel I solved it this way:
$post->text= Str::replace('<a ', '<a target="_blank"', $post->text);
Not works for a specific link. Edit all links in the Markdown text. (In my case it's fine)
I ran into this problem when trying to implement markdown using PHP.
Since the user generated links created with markdown need to open in a new tab but site links need to stay in tab I changed markdown to only generate links that open in a new tab. So not all links on the page link out, just the ones that use markdown.
In markdown I changed all the link output to be <a target='_blank' href="..."> which was easy enough using find/replace.
I do not agree that it's a better user experience to stay within one browser tab. If you want people to stay on your site, or come back to finish reading that article, send them off in a new tab.
Building on #davidmorrow's answer, throw this javascript into your site and turn just external links into links with target=_blank:
<script type="text/javascript" charset="utf-8">
// Creating custom :external selector
$.expr[':'].external = function(obj){
return !obj.href.match(/^mailto\:/)
&& (obj.hostname != location.hostname);
};
$(function(){
// Add 'external' CSS class to all external links
$('a:external').addClass('external');
// turn target into target=_blank for elements w external class
$(".external").attr('target','_blank');
})
</script>
You can add any attributes using {[attr]="[prop]"}
For example [Google] (http://www.google.com){target="_blank"}
For completed alex answered (Dec 13 '10)
A more smart injection target could be done with this code :
/*
* For all links in the current page...
*/
$(document.links).filter(function() {
/*
* ...keep them without `target` already setted...
*/
return !this.target;
}).filter(function() {
/*
* ...and keep them are not on current domain...
*/
return this.hostname !== window.location.hostname ||
/*
* ...or are not a web file (.pdf, .jpg, .png, .js, .mp4, etc.).
*/
/\.(?!html?|php3?|aspx?)([a-z]{0,3}|[a-zt]{0,4})$/.test(this.pathname);
/*
* For all link kept, add the `target="_blank"` attribute.
*/
}).attr('target', '_blank');
You could change the regexp exceptions with adding more extension in (?!html?|php3?|aspx?) group construct (understand this regexp here: https://regex101.com/r/sE6gT9/3).
and for a without jQuery version, check code below:
var links = document.links;
for (var i = 0; i < links.length; i++) {
if (!links[i].target) {
if (
links[i].hostname !== window.location.hostname ||
/\.(?!html?)([a-z]{0,3}|[a-zt]{0,4})$/.test(links[i].pathname)
) {
links[i].target = '_blank';
}
}
}
Automated for external links only, using GNU sed & make
If one would like to do this systematically for all external links, CSS is no option. However, one could run the following sed command once the (X)HTML has been created from Markdown:
sed -i 's|href="http|target="_blank" href="http|g' index.html
This can be further automated by adding above sed command to a makefile. For details, see GNU make or see how I have done that on my website.
If you just want to do this in a specific link, just use the inline attribute list syntax as others have answered, or just use HTML.
If you want to do this in all generated <a> tags, depends on your Markdown compiler, maybe you need an extension of it.
I am doing this for my blog these days, which is generated by pelican, which use Python-Markdown. And I found an extension for Python-Markdown Phuker/markdown_link_attr_modifier, it works well. Note that an old extension called newtab seems not work in Python-Markdown 3.x.
For React + Markdown environment:
I created a reusable component:
export type TargetBlankLinkProps = {
label?: string;
href?: string;
};
export const TargetBlankLink = ({
label = "",
href = "",
}: TargetBlankLinkProps) => (
<a href={href} target="__blank">
{label}
</a>
);
And I use it wherever I need a link that open in a new window.
For "markdown-to-jsx" with MUI v5
This seem to work for me:
import Markdown from 'markdown-to-jsx';
...
const MarkdownLink = ({ children, ...props }) => (
<Link {...props}>{children}</Link>
);
...
<Markdown
options={{
forceBlock: true,
overrides: {
a: {
component: MarkdownLink,
props: {
target: '_blank',
},
},
},
}}
>
{description}
</Markdown>
This works for me: [Page Link](your url here "(target|_blank)")
I'm facing an issue with CKEditor 4, I need to have an output without any html entity so I added config.entities = false; in my config, but some appear when
an inline tag is inserted: the space before is replaced with
text is pasted: every space is replaced with even with config.forcePasteAsPlainText = true;
You can check that on any demo by typing
test test
eg.
Do you know how I can prevent this behaviour?
Thanks!
Based on Reinmars accepted answer and the Entities plugin I created a small plugin with an HTML filter which removes redundant entities. The regular expression could be improved to suit other situations, so please edit this answer.
/*
* Remove entities which were inserted ie. when removing a space and
* immediately inputting a space.
*
* NB: We could also set config.basicEntities to false, but this is stongly
* adviced against since this also does not turn ie. < into <.
* #link http://stackoverflow.com/a/16468264/328272
*
* Based on StackOverflow answer.
* #link http://stackoverflow.com/a/14549010/328272
*/
CKEDITOR.plugins.add('removeRedundantNBSP', {
afterInit: function(editor) {
var config = editor.config,
dataProcessor = editor.dataProcessor,
htmlFilter = dataProcessor && dataProcessor.htmlFilter;
if (htmlFilter) {
htmlFilter.addRules({
text: function(text) {
return text.replace(/(\w) /g, '$1 ');
}
}, {
applyToAll: true,
excludeNestedEditable: true
});
}
}
});
These entities:
// Base HTML entities.
var htmlbase = 'nbsp,gt,lt,amp';
Are an exception. To get rid of them you can set basicEntities: false. But as docs mention this is an insecure setting. So if you only want to remove , then I should just use regexp on output data (e.g. by adding listener for #getData) or, if you want to be more precise, add your own rule to htmlFilter just like entities plugin does here.
Remove all but not <tag> </tag> with Javascript Regexp
This is especially helpful with CKEditor as it creates lines like <p> </p>, which you might want to keep.
Background: I first tried to make a one-liner Javascript using lookaround assertions. It seems you can't chain them, at least not yet. My first approach was unsuccesful:
return text.replace(/(?<!\>) (?!<\/)/gi, " ")
// Removes but not <p> </p>
// It works, but does not remove `<p> blah </p>`.
Here is my updated working one-liner code:
return text.replace(/(?<!\>\s.)( (?!<\/)|(?<!\>) <\/p>)/gi, " ")
This works as intended. You can test it here.
However, this is a shady practise as lookarounds are not fully supported by some browsers.
Read more about Assertions.
What I ended up using in my production code:
I ended up doing a bit hacky approach with multiple replace(). This should work on all browsers.
.trim() // Remove whitespaces
.replace(/\u00a0/g, " ") // Remove unicode non-breaking space
.replace(/((<\w+>)\s*( )\s*(<\/\w+>))/gi, "$2<!--BOOM-->$4") // Replace empty nbsp tags with BOOM
.replace(/ /gi, " ") // remove all
.replace(/((<\w+>)\s*(<!--BOOM-->)\s*(<\/\w+>))/gi, "$2 $4") // Replace BOOM back to empty tags
If you have a better suggestion, I would be happy to hear 😊.
I needed to change the regular expression Imeus sent, in my case, I use TYPO3 and needed to edit the backend editor. This one didn't work. Maybe it can help another one that has the same problem :)
return text.replace(/ /g, ' ');
If i have the & symbol in some field (from a db, cannot be changed), and i want to display this via freemarker... but have the display (from freemarker) read &, what is the way to do so?
To reiterate, I cannot change the value before hand (or at least, I don't want to), i'd like freemarker to "unmark" &.
To double re-iterate, this is a value that is being placed with a lot of other xml. The value itself is displayed on its own, surrouded by tags... so something like
<someTag>${wheeeMyValueWithAnAmpersand}<someTag>
As a result, i don't want all ampersands escaped, or the xml will look funny... just that one in the interpolation.
Oh goodness.
I see the problem: the code was written like this:
<#escape x as x?xml>
<#import "small.ftl" as my>
<#my.macro1/>
</#escape>
and at which i'd assumed that the excape would excape all the calls within it - it is certainly what the documentation sort of implies
http://freemarker.org/docs/ref_directive_escape.html
<#assign x = "<test>"> m1>
m1: ${x}
</#macro>
<#escape x as x?html>
<#macro m2>m2: ${x}</#macro>
${x}
<#m1/>
</#escape>
${x}
<#m2/>
the output will be:
<test>
m1: <test>
<test>
m2: <test>
However it appears that when you import the file, then this isn't the case, and the escape... escapes!
SOLUTION:
http://watchitlater.com/blog/2011/10/default-html-escape-using-freemarker/
the above link details how to solve the problem. In effect, it comes down to loading a different FreemakerLoader, one that wraps all templates with an escape tag.
class SomeCoolClass implements TemplateLoader {
//other functions here
#Override
public Reader getReader(Object templateSource, String encoding) throws IOException {
Reader reader = delegate.getReader(templateSource, encoding);
try {
String templateText = IOUtils.toString(reader);
return new StringReader(ESCAPE_PREFIX + templateText + ESCAPE_SUFFIX);
} finally {
IOUtils.closeQuietly(reader);
}
}
which is a snippet from the link above. You create the class with the existing templateLoader, and just defer all the required methods to that.
Starting from FreeMarker 2.3.24 no TemplateLoader "hack" is needed anymore. There's a setting called output_format, which specifies if and what escaping is needed. This can be configured both globally, and/or per-template-name-pattern utilizing the template_configurations setting. The recommend way of doing this is even simpler (from the manual):
[...] if the
recognize_standard_file_extensions setting is true (which is the
default with the incompatible_improvements setting set to 2.3.24 or
higher), templates whose source name ends with ".ftlh" gets "HTML"
output format, and those with ".ftlx" get "XML" output format
Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has .
string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />";
HtmlNode newNode = HtmlNode.Create(strIMG);
This breaks xhtml.
Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this:
doc.OptionWriteEmptyNodes = true;
Edit 1:Here is how to fix an HTML Agilty Pack document to correctly display image (img) tags:
if (HtmlNode.ElementsFlags.ContainsKey("img"))
{ HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;}
else
{ HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);}
replace "img" for any other tag to fix them as well (input, select, and option come up frequently). Repeat as needed. Keep in mind that this will produce rather than , because of the HAP bug preventing the "closed" and "empty" flags from being set simultaneously.
Source: Mike Bridge
Original answer:
Having just labored over solutions to this issue, and not finding any sufficient answers (doctype set properly, using Output as XML, Check Syntax, AutoCloseOnEnd, and Write Empty Node options), I was able to solve this with a dirty hack.
This will certainly not solve the issue outright for everyone, but for anyone returning their generated html/xml as a string (EG via a web service), the simple solution is to use fake tags that the agility pack doesn't know to break.
Once you have finished doing everything you need to do on your document, call the following method once for each tag giving you a headache (notable examples being option, input, and img). Immediately after, render your final string and do a simple replace for each tag prefixed with some string (in this case "Fix_", and return your string.
This is only marginally better in my opinion than the regex solution proposed in another question I cannot locate at the moment (something along the lines of )
private void fixHAPUnclosedTags(ref HtmlDocument doc, string tagName, bool hasInnerText = false)
{
HtmlNode tagReplacement = null;
foreach(var tag in doc.DocumentNode.SelectNodes("//"+tagName))
{
tagReplacement = HtmlTextNode.CreateNode("<fix_"+tagName+"></fix_"+tagName+">");
foreach(var attr in tag.Attributes)
{
tagReplacement.SetAttributeValue(attr.Name, attr.Value);
}
if(hasInnerText)//for option tags and other non-empty nodes, the next (text) node will be its inner HTML
{
tagReplacement.InnerHtml = tag.InnerHtml + tag.NextSibling.InnerHtml;
tag.NextSibling.Remove();
}
tag.ParentNode.ReplaceChild(tagReplacement, tag);
}
}
As a note, if I were a betting man I would guess that MikeBridge's answer above inadvertently identifies the source of this bug in the pack - something is causing the closed and empty flags to be mutually exclusive
Additionally, after a bit more digging, I don't appear to be the only one who has taken this approach:
HtmlAgilityPack Drops Option End Tags
Furthermore, in cases where you ONLY need non-empty elements, there is a very simple fix listed in that same question, as well as the HAP codeplex discussion here: This essentially sets the empty flag option listed in Mike Bridge's answer above permanently everywhere.
There is an option to turn on XML output that makes this issue go away.
var htmlDoc = new HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
htmlDoc.LoadHtml(rawHtml);
This seems to be a bug with HtmlAgilityPack. There are many ways to reproduce this, for example:
Debug.WriteLine(HtmlNode.CreateNode("<img id=\"bla\"></img>").OuterHtml);
Outputs malformed HTML. Using the suggested fixes in the other answers does nothing.
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
HtmlNode node = doc.CreateElement("x");
node.InnerHtml = "<img id=\"bla\"></img>";
doc.DocumentNode.AppendChild(node);
Debug.WriteLine(doc.DocumentNode.OuterHtml);
Produces malformed XML / XHTML like <x><img id="bla"></x>
I have created a issue in CodePlex for this.