Treat markup as translatable with SoyMsgExtractor - internationalization

Is there a way to have HTML markup inside {msg} not turned into placeholders when using SoyMsgExtractor?
Say I have some documentation:
{msg desc="Document X, can contain HTML markup"}
<p>Foo bar baz</p>
<p>Blah <b>blah</b> blah</p>
{/msg}
And I want the translatable message to contain the HTML markup rather than having it turned into placeholders. I.e. I'd like the generated XLIFF to read:
<source><p>Foo bar baz</p><p>Blah <b>blah</b> blah</p></source>
or
<source><![CDATA[<p>Foo bar baz</p><p>Blah <b>blah</b>blah</p>]]></source>
rather than
<source><x id="START_PARAGRAPH"/>Foo bar baz<x id="END_PARAGRAPH"/><x id="START_PARAGRAPH"/>
That way, translators can really feel free to split or merge paragraphs, add bold or italics, etc.
Additionally, it'd probably make the msg processing easier and possibly faster.

Related

Are there cases of editing HTML output by Aspose.Word with CKEditor?

I am in trouble with the event that the sentence edited in CKEditor are not output to Word as a result of inheriting attributes of “-aw-import:ignore”.
A tag with this attribute is a tag that conveys the attribute of the original word when converting from html to word, and it is not output as word as a meta tag.
If the sentence entered in CKEditor inherits the attributes, it will not be output as word by mistake.
Aspose.Words writes this "-aw-import:ignore" only when it needs to make certain elements visible in HTML that would otherwise be collapsed and hidden by web browsers e.g. empty paragraphs, space sequences, etc.
Currently we mark only the following elements with “-aw-import:ignore”:
Sequences of spaces and non-breaking spaces that are used to simulate
padding on native list item (<li>) elements.
Non-breaking spaces that are used to prevent empty paragraphs from collapsing.
However, note that this list is not fixed and we may add more cases to it in the future.
Also, please note that Aspose.Words write   instead of because is not defined in XML. And by default Aspose.Words generate XHTML documents (i.e. HTML documents that comply with XML rules).
I work with Aspose as Developer Evangelist.
Please find below list of custom styles that Aspose.Words uses to save extra information in output HTML and usually this information is used for Aspose.Words-HTML-Aspose.Words round-trip. We will add description of these entities in documentation as soon as possible.
-aw-comment-author
-aw-comment-datetime
-aw-comment-initial
-aw-comment-start
-aw-comment-end
-aw-footnote-type
-aw-footnote-numberstyle
-aw-footnote-startnumber
-aw-footnote-isauto
-aw-headerfooter-type
-aw-bookmark-start
-aw-bookmark-end
-aw-different-first-page
-aw-tabstop-align
-aw-tabstop-pos
-aw-tabstop-leader
-aw-field-code
-aw-wrap-type
-aw-left-pos
-aw-top-pos
-aw-rel-hpos
-aw-rel-vpos
-aw-revision-author
-aw-revision-datetime

Allow copy / paste in a text_area field form but remove formatting

I have a text_area field in a form which allows some text formatting through a very simple WYSIWYG (bold / underline / bullet points). This was aimed at having a consistent formatting in the description profile of the users.
<%= l.text_area :access, value: "#{t('.access_placeholder_html')}" %>
Nevertheless, some users usually filled the text_area by copy / pasting directly from their website. And their specific formatting "hypertext links", font size, etc. is after reflected on my website, which makes it a bit dirty.
How can I solve this problem. Ideally I would love that when saving the form it gets rid of all the HTML code that is not allowed instead of not allowing copy / paste. Is this possible? Was wondering if should use Sanitize but if so how? (Sorry new to code, I guess you would have understood).
You didn't say which version of Rails, but you could use #sanitize from ActionView::Helpers::SanitizeHelper module to strip all the HTML formatting. It scrubs HTML from text with a scrubber. The default scrubber allows optional whitelisting of attributes. You can even build your own custom scrubber to modify the string if you need more control over what is output. The module also contains #strip_tags and #strip_links, which are very simple scrubbers that remove all HTML tags and all HTML link tags (leaving the link text).
Note that you can wind up with malformed text if the user's input wasn't valid HTML.
Quick examples from the docs:
// remove all HTML tags except <strong> <em> <a> and the
// <href> attributes from #text
nomarkup_text = sanitize #text, tags: %w(strong em a), attributes: %w(href)
// remove all HTML markup
strip_tags("<b>Bold</b> no more! <a href='more.html'>See more here</a>...")
returns the string Bold no more! See more here...
// remove just the link markup
strip_links('Please e-mail me at me#email.com.')
returns the string Please e-mail me at me#email.com.
More detail at the API page for SantizeHelper

In ckeditor how can we avoid extra non-breaking spaces in paragraph tags?

When editing in ckeditor I very frequently end up with extra clusters of <p> </p> tags. Not only does it add extra unneeded linebreaks, they often show up on the resulting page with a broken-looking character in them.
Is there a configuration setting or something to tell the editor not to add these extra non-breaking spaces in paragraph tags?
Thanks,
doug
The paragraphs with represent empty lines in editor. They make the content look exactly the same inside editor and outside it (when displayed on a target page). If they cause you some problem, then it's not the editor, but your backend. So I rather recommend checking it.
Surprisingly though, there's an option to disable filling empty blocks config.fillEmptyBlocks.
But it's really not the answer.

CKEDITOR How to find and wrap text in span

I am writing a CKEDITOR plugin that needs to wrap certain pieces of text in a tag. From a webservice, I have an array of items that need to be wrapped. The array is just the plain text strings. Such as:
"[best buy", "horrible migraine", "eat cake"]
I need to find the instances of this text in the editor and wrap them in a span tag.
This is further complicated because the text may be marked up. So the HTML for "best buy" might be
"<strong>best</strong> buy"
but the text returned from the web service is stripped of any markup.
I started trying to use a CKEDITOR.htmlParser() object, and that seems like it is moderately successful. I am able to catch the parser.onText event and check if the text contains anything in my array.
But then I cannot modify that text. Modifications are not persisted back to the source html. So I think using the htmlParser() is a dead-end.
What is the best way to accomplish this task?
Oh, and as a bonus, I also do not want to lose my user's current cursor position when the changes are displayed.
Here is what I wound up doing and it seems to be working so far.
I created a text filter rule that searches through my array of items for any item that is contained (or partially contained) in the text. If so, it wraps the element in my span.
A drawback here is that I wind up with two spans for items with markup. But in my usecase, this is tolerable.
Then I set the results using:
editor.document.getBody().setHtml(results);
Because of this, I also have to strip this markup back out when this text gets read. I do this using an elements filter on editor.dataProcessor.htmlFilter.
This seems to be working well for my (so far limited) test cases.

Map RSS entries to HTML body w. non-exact search

How would you solve this problem?
You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any.
I know what you're thinking: You could just look at the RSS and ignore the HTML altogether! However, RSS very often contains only very short excerpts or strips away links that you might be interested in. You want to essentially defeat the excerptedness of the RSS by using the HTML and RSS of the same page together.
An RSS entry looks like:
title
excerpt of post body
permalink
A blog post in HTML looks like:
title (surrounded by permalink, maybe)
...
permalink, maybe
...
post body
...
permalink, maybe
So the HTML page contains the same fields but the placement of the permalink is not known in advance, and the fields will be separated by some noise text that is mostly HTML and white space but also could contain some additional metadata such as "posted by Johnny" or the date or something like that. The text may also be represented slightly different in HTML vs. RSS, as described below.
Additional rules/caveats:
Titles may not be unique. This happens more often than you might think. Examples I've seen: "Monday roundup", "TGIF", etc..
Titles may even be left blank.
Excerpts in RSS are also optional, but assume there must be at least either a non-blank excerpt or a non-blank title
The RSS excerpt may contain the full post content but more likely contains a short excerpt of the start of the post body
Assume that permalinks must be unique and must be the same in both HTML and RSS.
The title and the excerpt and post body may be formatted slightly differently in RSS and in HTML. For example:
RSS may have HTML inside of title or body stripped, or on the HTML page more HTML could be added (such as surrounding the first letter of the post body with something) or could be formatted slightly differently
Text may be encoded slightly differently, such as being utf8 in RSS while non-ascii characters in HTML are always encoded using ampersand encoding. However, assume that this is English text where non-ascii characters are rare.
There could be badly encoded Windows-1252 horribleness. This happens a lot for symbol characters like curly quotes. However, it is safe to assume that most of the text is ascii.
There could be case-folding in either direction, especially in the title. So, they could all-uppercase the title in the HTML page but not in RSS.
The number of entries in the RSS feed and the HTML page is not assumed to be the same. Either could have more or fewer older entries. We can only expect to get only those posts that appear in both.
RSS could be lagged. There may be a new entry in the HTML page that does not appear in the RSS feed yet. This can happen if the RSS is syndicated through Feedburner. Again, we can only expect to resolve those posts that appear in both RSS and HTML.
The body of a post can be very short or very long.
100% accuracy is not a constraint. However, the more accurate the better.
Well, what would you do?
I would create a scraper for each of the major blogging engines. Start with the main text for a single post per page.
If you're lucky then the engine will provide reasonable XHTML, so you can come up with a number of useful XPath expressions to get the node which corresponds to the article. If not, then I'm afraid it's TagSoup or Tidy to coerce it into well formed XML.
From there, you can look for the metadata and the full text. This should safely remove the headers/footers/sidebars/widgets/ads, though may leave embedded objects etc.
It should also be fairly easy (TM) to segment the page into article metadata, text, comments, etc etc and put it into fairly sensible RSS/Atom item.
This would be the basis of taking an RSS feed (non-full text) and turning it into a full text one (by following the permalinks given in the official RSS).
Once you have a scraper for a blog engine, you can start looking at writing a detector - something that will be the basis of the "given a page, what blog engine was it published with".
With enough scrapers and detectors, it should be possible to point a given RSS/Atom feed out and convert it into a full text feed.
However, this approach has a number of issues:
while you may be able to target the big 5 blog engines, there may be some blogs which you just have to have that aren't covered by them: e.g. there are 61 engines listed on Wikipedia; people who write their own blogging engines each need their own scraper.
each time a blog engine changes versions, you need to change your detectors and scrapers. More accurately, you need to add a new scraper and detector. The detectors have to become increasing more fussy to distinguish between one version of the same engine and the next (e.g. everytime slashcode changes, it usually changes the HTML, but different sites use different versions of slash).
I'm trying to think of a decent fallback, but I'll edit once I have.
RSS is actually quite simple to parse using XPath any XML parser (or regexes, but that's not recpmmended), you're going through the <item> tags, looking for <title>, <link>, <description> .
You can then post them as different fields in a database, or direcrtly merge them into HTML. In case the <description> is missing, you could scrape the link (one way would be to compare multiple pages to weed-out the layout parts of the HTML).

Resources