TWINE game localisation - twine-game-engine

Do anyone know if it is possible to localise a TWINE game? I’d like to have my interactive stories in all the Scandinavian languages. I also plan to have mp3 spoken narration in each language for non-readers at a later stage. My thought was to maybe have one complete story file per language but it seems like a hard thing to maintain.
Do anyone have a best-way of doing this?

It may be possible to localize your Twine game's default UI strings, depending on your story format. For example, if you're using the SugarCube v2 story format in Twine, then there are some SugarCube localizations here.
However, for your story text it's entirely up to you how you handle displaying that based on the user's choices. Again, assuming you're using SugarCube, you might have the user select the language in the beginning like this:
<<set $lang = "EN">>
''Select your language:'' <<listbox "$lang" autoselect>>
<<option "English" "EN" >>
<<option "русский" "RU">>
<<option "українська" "UK">>
<<option "Türkçe" "TR">>
<</listbox>>
That will give you a dropdown list of language options.
Then, in each of your passages, you would have something like this:
<<switch $lang>>
<<case "RU">>
(Russian version of passage.)
<<case "UK">>
(Ukranian version of passage.)
<<case "TR">>
(Turkish version of passage.)
<<default>>
(English version of passage.)
<</switch>>
You could put any non-language based code outside of that "switch" macro.
If you're using something other than SugarCube for your story format, then you'll likely use something similar to that.
Hope that helps! :-)

Related

How do you name i18n text?

When I work on web application with my colleges. The name of i18n text are given quite freely.
It's like each one has his own rule of naming.
Take an example, we have a text "Create a new item", it is used for a link.
A names the key in resources file like: CreateANewItem, which puts all word together.
B prefers to name it like this: CreateLinkText, which describes it's usage in the application.
C, however, wants to use: CreateItemText, which summarizes it's literal meaning.
When some text is longer or containing format of dynamic content. Naming varies a lot and agreement is hard to be met.
So I wonder whether there's a good naming rule or convention for the i18n text in different cases: short, long, with format, vulnerable to change, etc. Or how do you do this in your project? With this convention, maintenance can be easy and code is more readable.
Thanks a lot.
It's not something that I have seen so far in coding conventions. I guess it matters a lot less than other issues when it comes to coding standards. That said, I don't know what platform you are using, but if it's .NET there is a very short page on naming conventions for resource identifiers here: http://msdn.microsoft.com/en-us/library/vstudio/ms229037%28v=vs.100%29.aspx

text mining/analyse user commands/questions algorithm or library

I got a financial application and I wish to add to it the ability to get user command or input in textbox and then take the right action. for example, wish the user to write "show the revenue in the last 10 days" and it'll show the revenue to him/her - the point is that I wish it to really understand the meaning of the question, so the previus statement will bring the same results as "do I got any revenue in the last 10 days" or something like that - BI (something like the Wolfram|Alpha engine).
I wonder if there's any opensource library or algorithm books or whatever that I can use to learn the subject. Regards to opensource libraries - I don't mind which language it'll be written in.
I've read about this subject and saw many engines and services (OpenNLP, Apache UIMA, CoreNLP etc.) but did not figure out if they're right for my needs.
Any answer or suggestion is welcome.
Many thanks!
The field you're talking about is usually called "natural language processing". It's hard, and an active field of research. There are various libraries which you could consider based on your preferred programming language and use case:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
I've used NLTK a little bit. This field is seriously difficult to get right, so you might want to try to restrict your application to some small set of verbs and nouns such that people are using a controlled vocabulary in the first instance, and then try to extend it beyond that.

Why do people use plain english as translation placeholders?

This may be a stupid question, but here goes.
I've seen several projects using some translation library (e.g. gettext) working with plain english placeholders. So for example:
_("Please enter your name");
instead of abstract placeholders (which has always been my instinctive preference)
_("error_please_enter_name");
I have seen various recommendations on SO to work with the former method, but I don't understand why. What I don't get is what do you do if you need to change the english wording? Because if the actual text is used as the key for all existing translations, you would have to edit all the translations, too, and change each key. Or don't you?
Isn't that awfully cumbersome? Why is this the industry standard?
It's definitely not proper normalization to do it this way. Are there massive advantages to this method that I'm not seeing?
Yes, you have to alter the existing translation files, and that is a good thing.
If you change the English wording, the translations probably need to change, too. Even if they don't, you need someone who speaks the other language to check.
You prep a new version, and part of the QA process is checking the translations. If the English wording changed and nobody checked the translation, it'll stick out like a sore thumb and it'll get fixed.
The main language is already existent: you don't need to translate it.
Translators have better context with a real sentence than vague placeholders.
The placeholders are just the keys, it's still possible to change the original language by creating a translation for it. Because when the translation doesn't exists, it uses the placeholder as the translated text.
We've been using abstract placeholders for a while and it was pretty annoying having to write everything twice when creating a new function. When English is the placeholder, you just write the code in English, you have meaningful output from the start and don't have to think about naming placeholders.
So my reason would be less work for the developers.
I like your second approach. When translating texts you always have the problem of homonyms. Like 'open' can mean a state of a window but also the verb to perform the action. In other languages these homonyms may not exist. That's why you should be able to add meaning to your placeholders. Best approach is to put this meaning in your text library. If this is not possible on the platform the framework you use, it might be a good idea to define a 'development language'. This language will add meaning to the text entries like: 'action_open' and 'state_open'. you will off course have to put extra effort i translating this language to plain english (or the language you develop for). I have put this philosophy in some large projects and in the long run this saves some time (and headaches).
The best way in my opinion is keeping meaning separate so if you develop your own translation library or the one you use supports it you can do something like this:
_(i18n("Please enter your name", "error_please_enter_name"));
Where:
i18n(text, meaning)
Interesting question. I assume the main reason is that you don't have to care about translation or localization files during development as the main language is in the code itself.
Well it probably is just that it's easier to read, and so easier to translate. I'm of the opinion that your way is best for scalability, but it does just require that extra bit of effort, which some developers might not consider worth it... and for some projects, it probably isn't.
There's a fallback hierarchy, from most specific locale to the unlocalised version in the source code.
So French in France might have the following fallback route:
fr_FR
fr
Unlocalised. Source code.
As a result, having proper English sentences in the source code ensures that if a particular translation is not provided for in step (1) or (2), you will at least get a proper understandable sentence than random programmer garbage like “error_file_not_found”.
Plus, what do you do if it is a format string: “Sorry but the %s does not exist” ? Worse still: “Written %s entries to %s, total size: %d” ?
Quite old question but one additional reason I haven't seen in the answers yet:
You could end up with more placeholders than necessary, thus more work for translators and possible inconsistent translations. However, good editors like Poedit or Gtranslator can probably help with that.
To stick with your example:
The text "Please enter your name" could appear in a different context in a different template (that the developer is most likely not aware of and shouldn't need to be). E.g. it could be used not as an error but as a prompt like a placeholder of an input field.
If you use
_("Please enter your name");
it would be reusable, the developer can be unaware of the already existing key for an error message and would just use the same text intuitively.
However, if you used
_("error_please_enter_name");
in a previous template, developers wouldn't necessarily be aware of it and would make up a second key (most likely according to a predefined wording scheme to not end up in complete chaos), e.g.
_("prompt_please_enter_name");
which then has to be translated again.
So I think that doesn't scale very well. A pre-agreed wording scheme of suffixes/prefixes e.g. for contexts can never be as precise as the text itself I think (either too verbose or too general, beforehand you don't know and afterwards it's difficult to change) and is more work for the developer that's not worth it IMHO.
Does anybody agree/disagree?

What's needed for NLP?

assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.

i18n - best practices for internationalization - XLIFF, gettext, INI, ...? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
EDIT: I would really like to see some general discussion about the formats, their pros and cons!
EDIT2: The 'bounty didn't really help to create the needed discussion, there are a few interesting answers but the comprehensive coverage of the topic is still missing. Six persons marked the question as favourites, which shows me that there is an interest in this discussion.
When deciding about internationalization the toughest part IMO is the choice of storage format.
For example the Zend PHP Framework offers the following adapters which cover pretty much all my options:
Array : no, hard to maintain
CSV : don't know, possible problems with encoding
Gettext : frequently used, poEdit for all platforms available BUT complicated
INI : don't know, possible problems with encoding
TBX : no clue
TMX : too much of a big thing? no editors freely available.
QT : not very widespread, no free tools
XLIFF : the comming standard? BUT no free tools available.
XMLTM : no, not what I need
basically I'm stuck with the 4 'bold' choices. I would like to use INI files but I'm reading about the encoding problems... is it really a problem, if I use strict UTF-8 (files, connections, db, etc.)?
I'm on Windows and I tried to figure out how poEdit functions but just didn't manage. No tutorials on the web either, is gettext still a choice or an endangered species anyways?
What about XLIFF, has anybody worked with it? Any tips on what tools to use?
Any ideas for Eclipse integration of any of these technologies?
POEdit isn't really hard to get a hang of. Just create a new .po file, then tell it to import strings from source files. The program scans your PHP files for any function calls matching _("Text"), gettext("Text"), etc. You can even specify your own functions to look for.
You then enter a translation in the appropriate box. When you save your .po file, a .mo file is automatically generated. That's just a binary version of the translations that gettext can easily parse.
In your PHP script make a call to bindtextdomain() telling it where your .mo file is located. Now any strings passed to gettext (or the underscore function) will be translated.
It makes it really easy to keep your translation files up to date. POEdit also has some neat features like allowing comments, showing changed and dropped strings and allowing fuzzy matches, which means you don't have to re-translate strings that have been slightly modified.
There is always Translate Toolkit which allow translating between I think all mentioned formats, and preferred gettext (po) and XLIFF.
you can use INI if you want, it's just that INI doesn't have a way to tell anyone that it is in UTF8, so if someone opens your INI with an editor, it might corrupt yout file.
So the idea is that, if you can trust the user to edit it with a UTF8 encoding.
You can add a BOM at the start of the file, some editors knows about it.
What do you want it to store ? user generated content or your application ressources ?
I worked with two of these formats on the l18n side: TMX and XLIFF. They are pretty similar. TMX is more popular nowdays, but XLIFF is gaining support quickly. There was at least one free XLIFF editor when I last looked into it: Transolution but it is not being developed now.
I do the data storage myself using a custom design - All displayed text is stored in the DB.
I have two tables.
The first table has an identity value, a 32 character varchar field (indexed on this field)
and a 200 character english description of the phrase.
My second table has the identity value from the first table, a language code (EN_UK,EN_US,etc) and an NVARCHAR column for the text.
I use an nvarchar for the text because it supports other character sets which I don't yet use.
The 32 character varchar in the first table stores something like 'pleaselogin' while the second table actually stores the full "Please enter your login and password below".
I have created a huge list of dynamic values which I replace at runtime. An example would be "You have {[dynamic:passworddaysremain]} days to change your password." - this allows me to work around the word ordering in different languages.
I have only had to deal with Arabic numerals so far but will have to work something out for the first user who requires non arabic numbers.
I actually pull this information out of the database on a 2 hourly interval and cache it to the disk in a file for each language in XML. Extensive use of the CDATA is used.
There are many options available, for performance you could use html templates for each language - My method works well but does use the XML DOM a lot at runtime to create the pages.
One rather simple approach is to just use a resource file and resource script. Programs like MSVC have no problem editing them. They're also reasonably friendly to other systems (and to text editors) as well. You can just create separate string tables (and bitmap tables) for each language, and mark each such table with what language it is in.
None of those choices looks very appetizing to me.
If you're sending files out for translation in multiple languages, then you want to be able to trust that the encodings are correct, especially if you no one in your team speaks those languages. Sometimes it's difficult to spot an encoding problem in a foreign language, and it is just too easy to inadvertantly corrupt file encodings if you let your OS 'guess'.
You really want a format that declares its encoding. Otherwise, translators or their translation tools might select something other than UTF-8. For my money, any kind of simple XML format is best, but it looks like you'd need to roll your own in Zend. XLIFF and TMX are certainly overkill.
A format like Java's XML resources would be ideal.
This might be a little different from what's been posted so far and may not be exactly what you're looking for, but I thought I would add it, if for nothing else but a different approach. I went with an object-oriented approach. What I did was create a system that encapsulates language files into a class by storing them in an array of string=>translation pairs. Access to the translation is through a method called translate with the key string as a parameter. Extending classes inherit the parent's language array and can add to it or overwrite it. Because the classes are extensible, you can change a base class and have the changes propagate through the children, making more maintainable than an array by itself. Plus, you only call the classes you need.
We just store the strings in the DB and have a translator mode built into the application to handle actually adding strings for different languages.
In the application we use various tricks to create text ids, like
£("btn_save")
£(Order.class,"amt")
The translations is loaded from the db when the system boots, or when a reload is manually triggered. The £ method takes care of looking up the translated string according the the language specified in the user session.
You can check my l10n tool called iL10Nz on http://www.myl10n.net
You can upload po/pot files, xliff, ini files , translate, download.
you can also check out this video on youtube
http://www.youtube.com/watch?v=LJLmxMFxaxA
Thanks
Olivier

Resources