How can I mark all i18n changes as "fuzzy"? - internationalization

I have the following scenario:
A fellow writer is making "English only" changes to the .rst-files of a Sphinx project.
After running make gettext && sphinx-intl update [...] && sphinx-intl build, some of the translations of changed msgids are marked as fuzzy, while others are now empty.
I am searching for a way to mark all changed translations as fuzzy, for example through adjusting the minimum similarity requirement for fuzzy entries.

Related

next release of Stanza

I'm interested in the Stanza constituency parser for Italian.
In https://stanfordnlp.github.io/stanza/constituency.html it is said that a new release with updated models (including an Italian model trained on the Turin treebank) should have been available in mid-November.
Any idea about when the next release of Stanza will appear?
Thanks
alberto
Technically you can already get it! If you install the dev branch of stanza, you should be able to download an IT parser.
pip install git+git://github.com/stanfordnlp/stanza.git#704d90df2418ee199d83c92c16de180aacccf5c0
stanza.download("it")
It's trained on the Turin treebank, which has about 4000 trees. If you download the Bert version of the model, it gets over 91 F1 on the Evalita test set (but has a length limit of about 200 words per sentence).
We might splurge on getting the VIT treebank or something. I've been agitating that we use that budget on Danish or PT or some other language where we have very few users, but it's a hard sell...
Edit: there's also some scripts included for converting the publicly available Turin trees into brackets. Their MWT annotation style was to repeat the MWT twice in a row, which doesn't doesn't work too well for a task like parsing raw text.
It is still very much a live task ... either December or January, I would say.
p.s. This isn't really a great SO question....

Inconsistent line spacing in RestructuredText document

I'm build RST files for my company's documentation. One irritating thing is that enumerated lists don't seem to have any consistency in terms of line spacing.
Is there a simple way to solve this?
Robert
It's a well known problem of docutils, the library on which Sphinx is built.
From Sphinx issue tracker on GitHub:
tk0miya wrote:
In my short investigation:
The behavior comes from docutils (base library of Sphinx).
In docutils.writers.html4css1.HTMLTranslator, docutils generates <p> tag if list includes any items excepting paragraphs and nested lists.
To fix this, set self.compact_simple in visit_list_item instead of visit_bullet_list and visit_enumerated_list.
But we have to know why docutils check whole of list.
Source: Spinx-Doc/Sphinx #2258 - Nested field lists inside list items cause unwanted space in HTML output
See related issues:
https://github.com/rtfd/sphinx_rtd_theme/issues/119
I'm unsure how to apply Paebbels answer, however I was able to get rid of the <p> tags by changing to the html4 writer by adding this line to my conf.py.
html4_writer = true
This obviously changes it to the html4 writer, so you'll need to determine whether this is acceptable or not.

reStructuredText and glossary terms translation

I'd like to know how can I translate (as in i18n) the terms in glossary. I use Sphinx 1.1.3
Let's say I have:
.. glossary::
term
definition
After I run make gettext I get the .po files but I can only translate the definitions, not the terms. I searched the documentation throughout but couldn't find any hints. If translation of terms is somehow possible, how can I automatically sort them alphabetically in target language?
It seems that this feature will be available in Sphinx 1.2
The question about sorting the translated glossary still remains. :sorted: does not work.

Can I automatically update msgids in gettext's .po files for trivial text changes?

With gettext, the original (usually English) text of messages serves as
the message key ("msgid") for the translations. This means that every time the
original text changes, the msgid must be updated in all the .po files.
For real changes of the text, this is obviously unavoidable, as the
translator must update the translation.
However, if the change of the original does not change its meaning,
re-translation is superflous (e.g. change in punctation, whitespace
changes, or correction of a spelling mistake).
Is there a way to update the .po files automatically in that case?
I tried to use xgettext & msgmerge (with fuzzy matching turned on), but
fuzzy matching sometimes fails, plus this produces lots of ugly
"#,fuzzy" flags.
Note: There is a similar question:
How to efficiently work with gettext PO files when making small edits to large text values
However, it's about large strings, thus about a more specific problem.
One way to avoid the problem is to leave the msgids alone, have a .po file for the original language and make the fix inside that.
It always strikes me as being more of a work around than a proper fix though. For the next iteration (where there will definitely be more msgid changes) the msgid is changed and either the translators translate it in their usual update or each language is updated by hand when the msgid is changed.
I've had exactly this issue when doing minor changes to a django project. What I do is the following:
Change message in code.
Run find and replace on all translation files ("django.po"), replacing the old message (msgid) with the new one.
Run django-admin makemessages.
If I have done things right, the last step is superflous (i.e, you have done the change for gettext). django uses the gettext utilities, so it shouldn't matter how you make your message files.
I find and replace like so:
find . -name "*.po" -print | xargs sed -i 's/oldmessageid/newmessageid/g' Courtesy of http://rushi.vishavadia.com/blog/find-replace-across-multiple-files-in-linux

How to efficiently work with gettext PO files when making small edits to large text values

Looking for tips and/or tools on how to efficiently work with gettext PO files when making small edits to large msgid values.
Example: We have lots of multi-sentence/multi-paragraph messages that are stored in our PO message catalog files. If we make a very minor change to a message, perhaps editing a single sentence or even correcting punctuation, we lose our original translation when we run the msgmerge utility.
Rather than re-translate long messages (that have already gone through an editorial approval process) from scratch, our translators return to backup copies of their PO files and manually search for the text of the last msgid/msgstr translation pair which they then diff against the current msgid values to see what has changed, followed by a copy and paste of the last translation which they then edit to reflect the updated msgid value.
That's a lot of work! Certainly there must be a better way of handling this type of workflow?
Is there a best practice way to archive and find previous translations that are no longer in a PO file? One idea that comes to mind is to store a unique msg id in the text of our messages or in the comments that precede our message and use this id to retrieve previous msgid/msgstr translation pairs for review. Or are there PO editors or online services that make this process more efficient?
Thank you,
Malcolm
I've been looking for a way to make minor changes to msgids without disturbing existing translations - for instance, typo fixes in the source text. Here's a recipe I've just worked out that doesn't involve websites:
Use msgen from GNU gettext to generate an English-to-English po file:
msgen project.pot >corrections.po
Manually edit the msgstrs in "corrections.po" to reflect the typo fixes made in the source text, so we have a mapping from uncorrected to corrected strings. (I haven't thought about how to automate this bit.)
For each "real" translation (for example ca.po): abuse poswap from the Translate Toolkit (translate-toolkit in Ubuntu) to change the msgids:
poswap -i corrections.po -t ca.po -o ca.new.po
This does seem to lose header comments and obsolete strings from GNU gettext po files, but manually fixing those up is much less work than manually tweaking msgids in each translation (and could probably easily be scripted).
(Obviously, this should only be used in exceptional circumstances, where you're absolutely sure that none of the translators need the opportunity to re-review their translations.)
Virtaal's translation memory support can probably help with this. If your original units are in the translation memory, it will be shown (with differences) within a certain margin of change (based on Levenshtein distance). It will still contain the original (unmodified) translation, but at least the original text is more easily accessible and the differences highlighted.
I'm not 100% sure, but Pootle might also offer a web based solution. If you need any help, ask in #pootle on FreeNode.
The more general improvement is, of course, to separate/segment the units as far as possible.

Resources