Extracting strings for translation from VB6 code - ruby

I have a legacy VB application that still has some life in it, and I am wanting to translate it to another language.
I plan to write a Ruby script, possibly utilising a parser, to extract all strings from the three million lines of source, replace them with constants, and move them to a string resource file that can be used to provide translations.
Is anyone aware of a script/library that could be used to intelligently extract the strings?

I'm not aware of any existing off-the-shelf tool that you could use. We created a tool like this at my work and it worked well. The FRM file format is quite simple (although only briefly documented). We wrote a tool that (1) extracted all strings from control definitions and (2) generated the code to reload them at runtime during Form_Load.

Related

Automatic gettext translation generator for testing (pseudolocalization)

I'm currently in process of making site i18n-aware. Marking hardcoded strings as translatable.
I wonder if there's any automated tool that would let me browse the site and quickly see which strings are marked and which still aren't. I saw a few projects like django-i18n-helper that try to highlight translated strings using HTML facilities, but this doesn't work well with JavaScript.
So I thought FДЦЖ CУЯILLIC, 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯 or ʇxǝʇ uʍop-ǝpısdn (or something along those lines) should do the trick. Easy to distinguish visually, still readable, yet doesn't depend on any rich text formatting besides Unicode support.
The problem is, I can't find any readily-available tool that'd eat gettext .po/.pot file(s) and spew out such translation. Still, I think the idea is pretty obvious, so there must be something out there, already.
In my case I'm using Python/Django, but I suppose this question applies to anything that uses gettext-compatible library. The only thing the tool should be aware of, is that there could be HTML fragments in translation strings.
The msgfilter program will let you run your translations through any program you want. It works especially well with GNU sed.
For example, to turn all your translations into uppercase (HTML is mostly case-insensitive, so this should work):
msgfilter -i django.po sed -e 's/\(.*\)/\U\1/'
The only strings in your app that have lowercase letters in them would then be the hardcoded ones.
If you really want to do faux cyrillic, you just have to write a program or script that reads Latin and outputs that, and feed that program to msgfilter instead of sed.
If your distribution has a talkfilters package, it might provide a few programs that might be useful in this specific case. All of these should work as msgfilter filters. (My personal favorite is chef. Bork bork bork!)
Haven't tried this myself yet, but found podebug tool from Translate Toolkit. Based on documentation (flipped and unicode rewrite options), this looks exactly the tool I wished for.

Eliminating code duplication in a single file

Sadly, a project that I have been working on lately has a large amount of copy-and-paste code, even within single files. Are there any tools or techniques that can detect duplication or near-duplication within a single file? I have Beyond Compare 3 and it works well for comparing separate files, but I am at a loss for comparing single files.
Thanks in advance.
Edit:
Thanks for all the great tools! I'll definitely check them out.
This project is an ASP.NET/C# project, but I work with a variety of languages including Java; I'm interested in what tools are best (for any language) to remove duplication.
Check out Atomiq. It finds code that is duplicate that is prime for extracting to one location.
http://www.getatomiq.com/
If you're using Eclipse, you can use the copy paste detector (CPD) https://olex.openlogic.com/packages/cpd.
You don't say what language you are using, which is going to affect what tools you can use.
For Python there is CloneDigger. It also supports Java but I have not tried that. It can find code duplication both with a single file and between files, and gives you the result as a diff-like report in HTML.
See SD CloneDR, a tool for detecting copy-paste-edit code within and across multiple files. It detects exact copyies, copies that have been reformatted, and near-miss copies with different identifiers, literals, and even different seqeunces of statements.
The CloneDR handles many languages, including Java (1.4,1.5,1.6) and C# especially up to C#4.0. You can see sample clone detection reports at the website, also including one for C#.
Resharper does this automagically - it suggests when it thinks code should be extracted into a method, and will do the extraction for you
Check out PMD , once you have configured it (which is tad simple) you can run its copy paste detector to find duplicate code.
One with some Office skills can do following sequence in 1 minute:
use ordinary formatter to unify the code style, preferably without line wrapping
feed the code text into Microsoft Excel as a single column
search and replace all dual spaces with single one and do other replacements
sort column
At this point the keywords for duplicates will be already well detected. But to go further
add comparator formula to 2nd column and counter to 3rd
copy and paste values again, sort and see the most repetitive lines
There is an analysis tool, called Simian, which I haven't yet tried. Supposedly it can be run on any kind of text and point out duplicated items. It can be used via a command line interface.
Another option similar to those above, but with a different tool chain: https://www.npmjs.com/package/jscpd

Localization best practices

I'm starting to modify my app, which uses all hardcoded strings for errors, GUI, etc. I'm considering these two approaches, but let me know if there is an even better way:
-Put all string in ressource (.rc) files.
-define all strings in a file, once for each language. Use a preprocessor define to decide which strings get compiled in.
Which of these two approaches is generally prefered?
Put all the strings in resource files. Once you've done that, there's several good translation packages available. One useful thing these packages do is allow you to get translation done by somebody who doesn't program.
Remember, also, that internationalization (i18n) is a large subject, and there's a lot of things to consider. It isn't just a matter of translating strings. Do a web search on it, at the very least. You might want to read a book on it: I used International Programming for Windows by Schmitt as a guide. It's an old book from Microsoft Press, and I had to get it through a used book service; most of the more modern stuff seems to be on internationalizing .NET apps.
Without knowing more about your project (what sort of software, who the intended audience is, what sort of organization you have, what sort of budget, why you're interested in internationalization, etc.), this is about the most I can tell you.
Generally you see locale specific resource files containing strings referenced by key. Compiling different versions for different locales is a very rigid solution and will be a maintenance nightmare. Using resource files also allows the user to have fallback locales.
There's another approach of just putting strings in the source with somethign like tr(" ") and usign one of the tools that strips them out and converts them.
It works with any toolkit/GUI library.
You can mark text to be converted and text not to change (such as protocol strings or db keys).
It makes the source easier to read and search, isntead of having to lookup what IDS_MESSAGE34 means.
One problem with resource files, at least with Windows/MFC, is that you can't use the stringtable in dialogs. So you have some text in the stringtabel and some in the dialog section which you have to dela with separately.

How do you manage the String Translation Process?

I am working on a Software Project that needs to be translated into 30 languages. This means that changing any string incurs into a relatively high cost. Additionally, translation does not happen overnight, because the translation package needs to be worked by different translators, so this might take a while.
Adding new features is cumbersome somehow. We can think up all the Strings that will be needed before we actually code the UI, but sometimes still we need to add new strings because of bug fixes or because of an oversight.
So the question is, how do you manage all this process? Any tips in how to ease the impact of translation in the software project? How to rule the strings, instead of having the strings rule you?
EDIT: We are using Java and all Strings are internationalized using Resource Bundles, so the problem is not the internationalization per-se, but the management of the strings.
I'm not sure the platform you're internationalizing in. I've written an answer before on the best way to il8n an application. See What do I need to know to globalize an asp.net application?
That said - managing the translations themselves is hard. The problem is that you'll be using the same piece of text across multiple pages. Your framework may not, however, support only having that piece of text in one file (resource files in asp.net, for instance, encourage you to have one resource file per language).
The way that we found to work with things was to have a central database repository of translations. We created a small .net application to import translations from resource files into that database and to export translations from that database to resource files. There is, thus, an additional step in the build process to build the resource files.
The other issue you're going to have is passing translations to your translation vendor and back. There are a couple ways for this - see if your translation vendor is willing to accept XML files and return properly formatted XML files. This is, really, one of the best ways, since it allows you to automate your import and export of translation files. Another alternative, if your vendor allows it, is to create a website to allow them to edit the translations.
In the end, your answer for translations will be the same for any other process that requires repetition and manual work. Automate, automate, automate. Automate every single thing that you can. Copy and paste is not your friend in this scenario.
Pootle is an webapp that allows to manage translation process over the web.
There are a number of major issues that need to be considered when internationalizing an application.
Not all strings are created equally. Depending upon the language, the length of a sentence can change significantly. In some languages, it can be half as long and in others it can be triple the length. Make sure to design your GUI widgets with enough space to handle strings that are larger than your English strings.
Translators are typically not programmers. Do not expect the translators to be able to read and maintain the correct file formats for resource files. You should setup a mechanism where you can transform the translated data round trip to your resource files from something like an spreadsheet. One possibility is to use XSL filters with Open Office, so that you can save to Resource files directly in a spreadsheet application. Also, translators or translation service companies may already have their own databases, so it is good to ask about what they use and write some tools to automate.
You will need to append data to strings - don't pretend that you will never have to or you will always be able to put the string at the end. Make sure that you have a string formatter setup for replacing placeholders in strings. Furthermore, make sure to document what are typical values that will be replaced for the translators. Remember, the order of the placeholders may change in different languages.
Name your i8n string variables something that reflects their meaning. Do you really want to be looking up numbers in a resource file to find out what is the contents of a given string. Developers depend on being able to read the string output in code for efficiency a lot more than they often realize.
Don't be afraid of code-generation. In my current project, I have written a small Java program that is called by ant that parses all of the keys of the default language (master) resource file and then maps the key to a constant defined in my localization class. See below. The lines in between the //---- comments is auto-generated. I run the generator every time I add a string.
public final class l7d {
...normal junk
/**
* Reference to the localized strings resource bundle.
*/
public static final ResourceBundle l7dBundle =
ResourceBundle.getBundle(BUNDLE_PATH);
//---- start l7d fields ----\
public static final String ERROR_AuthenticationException;
public static final String ERROR_cannot_find_algorithm;
public static final String ERROR_invalid_context;
...many more
//---- end l7d fields ----\
static {
//---- start setting l7d fields ----\
ERROR_AuthenticationException = l7dBundle.getString("ERROR_AuthenticationException");
ERROR_cannot_find_algorithm = l7dBundle.getString("ERROR_cannot_find_algorithm");
ERROR_invalid_context = l7dBundle.getString("ERROR_invalid_context");
...many more
//---- end setting l7d fields ----\
}
The approach above offers a few benefits.
Since your string key is now defined as a field, your IDE should support code completion for it. This will save you a lot of type. It get's really frustrating looking up every key name and fixing typos every time you want to print a string.
Someone please correct me if I am wrong. By loading all of the strings into memory at static instantiation (as in the example) will result in a quicker load time at the cost of additional memory usage. I have found the additional amount of memory used is negligible and worth the trade off.
The localised projects I've worked on had 'string freeze' dates. After this time, the only way strings were allowed to be changed was with permission from a very senior member of the project management team.
It isn't exactly a perfect solution, but it did enable us to put defects regarding strings on hold until the next release with a valid reason. Once the string freeze has occured you also have a valid reason to deny adding brand new features to the project on 'spur of the moment' decisions. And having the permission come from high up meant that middle managers would have no power to change specs on your :)
If available, use a database for this. Each string gets an id, and there is either a table for each language, or one table for all with the language in a column (depending on how the site is accessed the performance dictates which is better). This allows updates from translators without trying to manage code files and version control details. Further, it's almost trivial to run reports on what isn't translated, and keep track of what was an autotranslation (engine) vs a real human translation.
If no database, then I stick each language in a separate file so version control issues are reduced. But the structure is basically the same - each string has an id.
-Adam
Not only did we use a database instead of the vaunted resource files (I have never understood why people use something like that which is a pain to manage, when we have such good tools for dealing with databases), but we also avoided the need to tag things in the application (forgetting to tag controls with numbers in VB6 Forms was always a problem) by using reflection to identify the controls for translation. Then we use an XML file which translates the controls to the phrase IDs from the dictionary database.
Although the mapping file had to be managed, it could still be managed independent of the build process, and the translation of the application was actually possible by end-users who had rights in the database.
The solution we came up to so far is having a small application in Excel that reads all the property files, and then shows a matrix with all the translations (languages as headers, keys as rows). It is quite evident what is missing then. This is send to the translators. When it comes back, then the sheet can be processed to generate the same property bundles back again. So far it has eased the pain somewhat, but I wonder what else is around.
This google book - resource file management gives some good tips
You can use Resource File Management software to keep track of strings that have changed and control the workflow to get them translated - otherwise you end up in a mess of freezes and overbearing version control
Some tools that do this sort of thing - no connection an I haven't actually used them, just researching
http://www.sisulizer.com/
http://www.translationzone.com/en/products/
I put in a makefile target that finds all the .properties files and puts them in a zip file to send off to the translators. I offered to send them just diffs, but for some reason they want the whole bundle of files each time. I think they have their own system for tracking just differences, because they charge us based on how many strings have changed from one time to the next. When I get their delivery back, I manually diff all their files with the previous delivery to see if anything unexpected has changed - one time all the PT_BR (Brazillian Portuguese) strings changed, and it turns out they'd used a PT_PT (Portuguese Portuguese) translator for that batch in spite of the order for PT_BR.
In Java, internationalization is accomplished by moving the strings to resource bundles ... the translation process is still long and arduous, but at least it's separated from the process of producing the software, releasing service packs etc. One thing that helps is to have a CI system that repackages everything any time changes are made. We can have a new version tested and out in a matter of minutes whether it's a code change, new language pack or both.
For starters, I'd use default strings in case a translation is missing. For example, the English or Spanish value.
Secondly, you might want to consider a web app or something similar for your translators to use. This requires some resources upfront, but at least you won't need to send files around and it will be obvious for the translators which strings are new, etc.

What are the best practices for building multi-lingual applications on win32?

I have to build a GUI application on Windows Mobile, and would like it to be able user to choose the language she wants, or application to choose the language automatically. I consider using multiple dlls containing just required resources.
1) What is the preferred (default?) way to get the application choose the proper resource language automatically, without user intervention? Any samples?
2) What are my options to allow user / application control what language should it display?
3) If possible, how do I create a dll that would contain multiple language resources and then dynamically choose the language?
For #1, you can use the GetSystemDefaultLangID function to get the language identifier for the machine.
For #2, you could list languages you support and when the user selects one, write the selection into a text file or registry (is there a registry on Windows Mobile?). On startup, use the function in #1 only if there is no selection in the file or registry.
For #3, the way we do it is to have one resource DLL per language, each of which contains the same resource IDs. Once you figure out the language, load the DLL for that language and the rest just works.
Re 1: The previous GetSystemDefuaultLangID suggestion is a good one.
Re 2: You can ask as a first step in your installation. Or you can package different installers for each language.
Re 3:
In theory the DLL method mentioned above sounds great, however in practice it didn't work very well at all for me personally.
A better method is to surround all of the strings in your program with either: Localize or NoLocalize.
MessageBox(Localize("Hello"), Localize("Title"), MB_OK);
RegOpenKey(NoLocalize("\\SOFTWARE\\RegKey"), ...);
Localize is just a function that converts your english text to a the selected language. NoLocalize does nothing.
You want to surround your strings with these values though because you can build a couple of useful scripts in your scripting language of choice.
1) A script that searches for all the Localize(" prefixes and outputs a .ini file with english=otherlangauge name value pairs. If the output .ini file already contains a mapping you don't add it again. You never re-create the ini file completely, your script just adds the missing ones each time you run your script.
2) A script that searches all the strings and makes sure they are surrounded by either Localize(" or NoLocalize(". If not it tells you which strings you still need to localize.
The reason #2 is important is because you need to make sure all of your strings are actually consciously marked as needing localization or not. Otherwise it is absolutely impossible to make sure you have proper localization.
The reason for #1 instead of loading from a DLL is because it takes no work to maintain this solution and you can add new strings that need to be translated on the fly.
You ship the ini files that are output with your program. You also give these ini files to your translators so they can convert the english=otherlanguage pairs. When they send it back to you, you simply replace your checked in .ini file with the one given by your translator. Running your script as mentioned in #1 will re-add any missing translations if any were done while the translator was translating.

Resources