Can I automatically update msgids in gettext's .po files for trivial text changes? - internationalization

With gettext, the original (usually English) text of messages serves as
the message key ("msgid") for the translations. This means that every time the
original text changes, the msgid must be updated in all the .po files.
For real changes of the text, this is obviously unavoidable, as the
translator must update the translation.
However, if the change of the original does not change its meaning,
re-translation is superflous (e.g. change in punctation, whitespace
changes, or correction of a spelling mistake).
Is there a way to update the .po files automatically in that case?
I tried to use xgettext & msgmerge (with fuzzy matching turned on), but
fuzzy matching sometimes fails, plus this produces lots of ugly
"#,fuzzy" flags.
Note: There is a similar question:
How to efficiently work with gettext PO files when making small edits to large text values
However, it's about large strings, thus about a more specific problem.

One way to avoid the problem is to leave the msgids alone, have a .po file for the original language and make the fix inside that.
It always strikes me as being more of a work around than a proper fix though. For the next iteration (where there will definitely be more msgid changes) the msgid is changed and either the translators translate it in their usual update or each language is updated by hand when the msgid is changed.

I've had exactly this issue when doing minor changes to a django project. What I do is the following:
Change message in code.
Run find and replace on all translation files ("django.po"), replacing the old message (msgid) with the new one.
Run django-admin makemessages.
If I have done things right, the last step is superflous (i.e, you have done the change for gettext). django uses the gettext utilities, so it shouldn't matter how you make your message files.
I find and replace like so:
find . -name "*.po" -print | xargs sed -i 's/oldmessageid/newmessageid/g' Courtesy of http://rushi.vishavadia.com/blog/find-replace-across-multiple-files-in-linux

Related

How do I not commit the development team lines in project.pbxproj without deselecting those lines manually?

I am collaborating with my friend on an iOS app. We use different Apple IDs in our Xcodes, so in "Signing and Capabilities" tab of project settings, we select different teams in the "Team" field:
From my observation, changing this affects the MyProject.xcodeproj/project.pbxproj file, which stores the file references that the Xcode project has, in addition to the "Team". Here's a snippet of what is changed:
buildSettings = {
ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon;
ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor;
CODE_SIGN_STYLE = Automatic;
DEVELOPMENT_TEAM = <my team ID>; /* this is changed */
INFOPLIST_FILE = MyProject/Info.plist;
LD_RUNPATH_SEARCH_PATHS = (
"$(inherited)",
"#executable_path/Frameworks",
);
PRODUCT_BUNDLE_IDENTIFIER = io.github.sweeper777.MyApp;
PRODUCT_NAME = "$(TARGET_NAME)";
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = 1;
};
The problem arises, when one of us commits this file and the other person pulls. The "puller" will now have the "Team" set to something invalid. When this person then tries to run the app on a real device, there will be code signing errors for obvious reasons. To solve this, this person must tediously go through all the targets that we have, and set each "Team" to their own team.
How can we make it so that on each person's computer, the "Team" stays the same after pulling, but any other changes to MyProject.xcodeproj/project.pbxproj is applied?
Remarks:
Putting the entire MyProject.xcodeproj/project.pbxproj in .gitignore doesn't work, because that would ignore every other change to it. Adding a new file to the project, for example, also changes MyProject.xcodeproj/project.pbxproj, and we want to be able to pull that change.
Manually deselecting the lines that say "DEVELOPMENT_TEAM = ..." when committing is as tedious as reselecting the correct team every time, so that's not a solution.
I found this. Apparently, I can configure git to run sed before git checkout and git add. However, that answer seems ignore the line by deleting it completely. This means that my friend, when he pulls, would still have to reselect the correct team. What I want is the kind of "ignore" that simply stops tracking that line. That is, if there is a local version of that line, use that.
I am also aware that this all wouldn't be a problem if we are on the same team. But if I understand this correctly, I can't have multiple people on my team unless I have a Company account, and not only can I not afford that, I don't own a Company.
I don't use Xcode itself and do not know how to smuggle Git hooks and scripts past the Xcode interface, so you'll need more than just this answer. But you mention sed in comments, and given your proposed file format, that may well be the way to go:
buildSettings = {
ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon;
ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor;
CODE_SIGN_STYLE = Automatic;
DEVELOPMENT_TEAM = <my team ID>; /* this is changed */
INFOPLIST_FILE = MyProject/Info.plist;
LD_RUNPATH_SEARCH_PATHS = (
"$(inherited)",
"#executable_path/Frameworks",
);
PRODUCT_BUNDLE_IDENTIFIER = io.github.sweeper777.MyApp;
PRODUCT_NAME = "$(TARGET_NAME)";
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = 1;
};
Git has the ability to run what it calls clean and smudge filters. These can be used to run any arbitrary program you like, including sed, the "stream editor", which is particularly good at making single-line changes based on regular expression matches.
There is another method that may also work, and may "play better" with Xcode, or may play worse. I'll go over that too, after covering clean and smudge filters.
Before we dive into writing clean and smudge filters, and using them from Git—you'll need to know all of these details as you will have to write your own custom filters—we should start with a simple fact about Git commits: No part of any commit can ever be changed. Once you make a commit, the stuff that's inside the commit—the stored data in all of its files—is the way it is, forever. So these filters have to work within that system. Remember that, as it will help with understanding what we're doing.
How Git makes and stores objects
The files inside a commit are not files, exactly: they're not the same thing as files in your file system, at least. Instead, they are what Git calls objects, specifically blob objects. A blob object holds the file's data; other objects hold the file's name; and commit objects collect everything together to be used all at once. There's one more internal object type for annotated tags but we'll stop here as we're really only interested in the blob-object part.
When Git extracts a commit, it reads the internal blob objects and runs them through internal code to decompress and format them into regular files. This can include doing end-of-line hacking (turning LFs into CRLFs) if desired. Normally, all this happens entirely inside Git, and the end result is that Git writes out an ordinary everyday file for you to use. This ordinary file is what you will work on / with, in Xcode or any other editor and compiler system and so on. These ordinary files are in your working tree.
After you've extracted some commit, you'll do some work on it, by changing some or all of the files in your working tree, to achieve whatever result you wanted. This can include changing the buildSettings, editing Swift code, editing Objectionable-C Objective-C code, and so on. You might add all-new files to the working tree, some of which you never commit at all (you can help make sure this never happens by listing such files in .gitignore).
Eventually, though, you'd like to commit the updated code. To do so, you must run git add, or maybe have your IDE run git add for you (perhaps Xcode has clicky buttons to do this). This invokes code in Git that converts the working tree file(s) back to internal blob objects if and as needed.
Again, normally this is all handled entirely inside of Git. Git will read the working tree file, maybe do CRLF-to-LF-only changes, compress the text, search for duplicate objects, and do all the other complicated things necessary to prepare the file, so that it is ready to be committed. The resulting data need not match what's in your working tree at all: it just has to be something that, when Git later goes to extract the file, produces what you will need in your working tree.
Clean and smudge filters
This is where these clean and smudge filters come in. I said, above, that normally, Git does the extraction and insertion all on its own. For binary files, the only thing Git does here is apply lossless compression.1 For text files, Git can do CRLF/LF substitutions as well. But what if you'd like to do your own operations?
You can: Git will let you do whatever you want during the extract process with a smudge filter, and will let you do whatever you want during the compress process with a clean filter. The clean filter replaces the in-file data, using a stream-edit type process,2 and then Git does its CRLF hacking if any and compressing on the "cleaned" data. The smudge filter replaces the decompressed, post-CRLF-hacking data coming out of Git with the data that should go into the working tree.
Hence you can write, as your clean filter, a sed script of the form:
s/DEVELOPMENT_TEAM = .*;/DEVELOPMENT_TEAM = DEVTEAMTEMPLATE;/
With that as the entire sed script, what sed will do is edit the incoming data stream and replace any actual development team text with the word DEVTEAMTEMPLATE.
Your smudge filter has to work slightly harder: it must find the template line and adjust it so that it contains the correct team ID. Where will you get the correct team ID? That's up to you: perhaps you can store it in a file in your home directory, or in a file that you create in the working tree but never commit in Git. You'll have to write this one or two or however-many-liner sed and/or shell script yourself.
1There are multiple phases of compression; git add does just one, and git checkout undoes all—including reading from "pack" files—as needed. The deeper level of compression, using delta encoding techniques, is entirely invisible at the "object" level, so nobody ever really has to think about it.
2With the advent of Git-LFS, Git gained the ability to run long-lived filters. Before that, Git always used simple stream filtering. The stream filtering is easier to understand, but is less efficient for doing en-masse operations on many files. Here, we're only interested in one file per repository anyway, so there's no need to go into the fancier long-lived filter details.
Defining clean and smudge filters
The tricky part here, with Git, is that you must define the filters in one place—in $HOME/.gitconfig or .git/config, for instance—and then tell Git to invoke them from another place, using the .gitattributes file. This is described in the gitattributes documentation. This documentation is pretty thorough, so read it. You can ignore all the long-running filter discussion, as noted above. I will quote one bit from the documentation here for emphasis, though, and expound on it:
Note that "%f" is the name of the path that is being worked on. Depending on the version that is being filtered, the corresponding file on disk may not exist, or may have different contents. So, smudge and clean commands should not try to access the file on disk, but only act as filters on the content provided to them on standard input.
When Git is running the smudge filter, it:
has opened some internal object (which may or may not be packed);
has decompressed it, or is in the process of decompressing it, and pumped / is-pumping out the data; and
this data is being fed to your filter, but is not written out to any file anywhere.
Your filter can use %f to know the name of the target output file, but the data are not in that file yet. The data bytes are only in some OS-level pipes or sockets or whatever your OS uses for connecting the output of one program (Git's internal decompressors) to another (your filter). Your smudge filter must read its standard input to get the data, and write the smudged data to standard output so that Git can read it (if necessary) and/or redirect that output to the correct file. Do not attempt to open the file by name!
(The same holds for the clean filter, except that in many cases, the input to your filter is just the raw data already in the file, so that opening the file and reading it mostly works. So this can mislead you, if you do your tests using a clean filter.)
Note that you can implement this scheme without a clean filter at all: your smudge filter can replace whatever is in the committed file even if it's a real team ID, rather than just a template. If you choose to do this, however, you'll "see" the team ID changing every time a different team-ID commits the file. The nice thing about using the clean filter is that once the committed copies of the file use the template line, every future cleaned file also uses the template line, so that it never changes.
Alternative: a template file
In general, it's unwise to commit actual configurations. Clean and smudge tricks can work, but they can only go so far: this particular file format works well because the change you want made is on a single line, and Git itself shows you file changes on a line-by-line basis, and sed works well with line-oriented input, and so on.
A lot of configuration files, though, wind up storing at least slightly-sensitive data, or perhaps very-sensitive data such as cleartext passwords. Such files should not be stored in Git at all if at all possible. Instead, you would store a template file in Git.
In this case, for instance, instead of storing MyProject.xcodeproj/project.pbxproj, you might have Git store MyProject.xcodeproj/project.pbxproj.template. This file would have template-ized contents. When you clone and check out the repository, you'd subsequently copy the template file into place and do any required adjustments.
Should the MyProject.xcodeproj/project.pbxproj file itself need to change, e.g., to acquire a new SWIFT_VERSION setting, you'd instead edit the template file, add that to Git, and commit. You would then use the usual "convert template to mine" process, or manually update the MyProject.xcodeproj/project.pbxproj file. Since this file is never committed—and is listed in .gitignore—it never goes into any commit and you never have to worry about collisions within it. Only the template file goes into Git.

Poedit without location comments

I'd like to have .po-file without path-comments (location). Our team has because of them a lot of trouble during git-merge.
I can get the file without comment using command:
$ xgettext --no-location -o input.po output.po
BUT! It removes headlines in the file beginning as well. Without those headlines PoEdit does not work correctly.
How can i remove all path-comments but store headlines in *.po file?
You answered your own question: use --no-location.
--no-location does not remove the gettext header (which you probably mean by "headlines"). It would make no sense for it to do it and Poedit itself uses it for its own PO(T) file(s).
Your problem is in your command — read xgettext manpage, it is for extraction from source code, not manipulating existing PO files (that's what msgcat is for).

How to find foreign language used in "C comments"

I have a large source code where most of the documentation and source code comments are in english. But one of the minor contributors wrote comments in a different language, spread in various places.
Is there a simple trick that will let me find them ? I imagine first a way to extract all comments from the code and generate a single text file (with possible source file / line number info), then pipe this through some language detection app.
If that matters, I'm on Linux and the current compiler on this project is CLang.
The only thing that comes to mind is to go through all of the code manually and check it yourself. If it's a similar language, that doesn't contain foreign letters, consider using something with a spellchecker. This way, the text that isn't recognized will get underlined, and easy to spot.
Other than that, I don't see an easy way to go through with this.
You could make a program, that reads the files and only prints the comments out to another output file, where you then spell check that file, but this would seem to be a waste of time, as you would easily be able to spot the comments yourself.
If you do make a program for that, however, keep in mind that there are three things to check for:
If comment starts with /*, make sure it stops reading when encountering */
If comment starts with //, only read one line - unless:
If line starting with // ends with \, read next line as well
While it is possible to detect a language from a string automatically, you need way more words than fit in a usual comment to do so.
Solution: Use your own eyes and your own brain...

Ruby library for manipulating XML with minimal diffs?

I have an XML file (actually a Visual C# project file) that I want to manipulate using a Ruby script. I want to read the XML into memory, do some work on them that includes changing some attributes and some text (fixing up some path references), and then write the XML file back out. This isn't so hard.
The hard part is, I want the file I write to look the same as the file I read in, except where I made changes. If the input file used double quotes, I want the output to use double quotes. If the input had a space before />, I want the output to do the same. Basically, I want the output to be the same as the input, except where I explicitly made changes (which, in my case, will only be to attribute values, or to the text content of an element).
I want minimal diffs because this project file is checked into version control -- and because the next time I make a change in Visual Studio, it's going to rewrite it in its preferred format anyway. I want to avoid checking in a bunch of meaningless diffs that will then be changed back again in the near future. I also want to avoid having to open the project in Visual Studio, make a change, and save, before I can commit my Ruby script's changes. I want my Ruby script to just make its changes, nothing more.
I originally just parsed the file with regexes, but ran into cases where I really needed an XML library because I needed to know more about child elements. So I switched to REXML. But it makes the following undesirable changes to my formatting:
It changes all the attributes from double quotes to single quotes.
It escapes all the apostrophes inside attribute values (changing them to &apos;).
It removes the space before />.
It sorts each element's attributes alphabetically, rather than preserving the original order.
I'm working around this by doing a bunch of gsub calls on REXML's output, but is there a Ruby XML-manipulation library that's a better fit for "minimal diff" scenarios?
You can build your own SAX parser (using Nokogiri, for example, it's very easy and I recommend to use it) to parse your XML file, change some data in it, and flush the processed XML file with your own customized, built from scratch, XML generator. The bad news is, you have to build a tiny XML library and generator routine in this case, so it is not an ordinary task.
Another way: don't build the SAX parser, but write an XML generator. Parse XML with your favourite library, change what you need to change and generate anything you want. You just need to recursively walk through all nodes in your document and output them within your conventions.

How to efficiently work with gettext PO files when making small edits to large text values

Looking for tips and/or tools on how to efficiently work with gettext PO files when making small edits to large msgid values.
Example: We have lots of multi-sentence/multi-paragraph messages that are stored in our PO message catalog files. If we make a very minor change to a message, perhaps editing a single sentence or even correcting punctuation, we lose our original translation when we run the msgmerge utility.
Rather than re-translate long messages (that have already gone through an editorial approval process) from scratch, our translators return to backup copies of their PO files and manually search for the text of the last msgid/msgstr translation pair which they then diff against the current msgid values to see what has changed, followed by a copy and paste of the last translation which they then edit to reflect the updated msgid value.
That's a lot of work! Certainly there must be a better way of handling this type of workflow?
Is there a best practice way to archive and find previous translations that are no longer in a PO file? One idea that comes to mind is to store a unique msg id in the text of our messages or in the comments that precede our message and use this id to retrieve previous msgid/msgstr translation pairs for review. Or are there PO editors or online services that make this process more efficient?
Thank you,
Malcolm
I've been looking for a way to make minor changes to msgids without disturbing existing translations - for instance, typo fixes in the source text. Here's a recipe I've just worked out that doesn't involve websites:
Use msgen from GNU gettext to generate an English-to-English po file:
msgen project.pot >corrections.po
Manually edit the msgstrs in "corrections.po" to reflect the typo fixes made in the source text, so we have a mapping from uncorrected to corrected strings. (I haven't thought about how to automate this bit.)
For each "real" translation (for example ca.po): abuse poswap from the Translate Toolkit (translate-toolkit in Ubuntu) to change the msgids:
poswap -i corrections.po -t ca.po -o ca.new.po
This does seem to lose header comments and obsolete strings from GNU gettext po files, but manually fixing those up is much less work than manually tweaking msgids in each translation (and could probably easily be scripted).
(Obviously, this should only be used in exceptional circumstances, where you're absolutely sure that none of the translators need the opportunity to re-review their translations.)
Virtaal's translation memory support can probably help with this. If your original units are in the translation memory, it will be shown (with differences) within a certain margin of change (based on Levenshtein distance). It will still contain the original (unmodified) translation, but at least the original text is more easily accessible and the differences highlighted.
I'm not 100% sure, but Pootle might also offer a web based solution. If you need any help, ask in #pootle on FreeNode.
The more general improvement is, of course, to separate/segment the units as far as possible.

Resources