Searching the Mac OSX system dictionaries? - cocoa

I'd like to search for words in the OS X system dictionary (or dictionaries) using a simple glob or regex rather than a known text. (Currently I'm using /usr/share/dict/words instead, but the OSX dict would be a lot nicer.)
The Dictionary Services interface is quite limited and doesn't allow this, but it seems like DSGetTermRangeInString might be doing something similar under the hood. Does anyone know of a way to access such functionality?
Alternatively, is there a way to extract a word list from the dictionary? I could then grep that. Some dictionaries seem to include the source XML in the bundle, which should be easy enough to parse, but (not surprisingly, I guess) the big language dictionaries only have the data in some binary format. Any clues as to what that might be?

The dictionary Apple provides in OS X is licensed from one of the major publishers. Legally, they can't let you dump the whole word list.

Related

Automatic gettext translation generator for testing (pseudolocalization)

I'm currently in process of making site i18n-aware. Marking hardcoded strings as translatable.
I wonder if there's any automated tool that would let me browse the site and quickly see which strings are marked and which still aren't. I saw a few projects like django-i18n-helper that try to highlight translated strings using HTML facilities, but this doesn't work well with JavaScript.
So I thought FДЦЖ CУЯILLIC, 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯 or ʇxǝʇ uʍop-ǝpısdn (or something along those lines) should do the trick. Easy to distinguish visually, still readable, yet doesn't depend on any rich text formatting besides Unicode support.
The problem is, I can't find any readily-available tool that'd eat gettext .po/.pot file(s) and spew out such translation. Still, I think the idea is pretty obvious, so there must be something out there, already.
In my case I'm using Python/Django, but I suppose this question applies to anything that uses gettext-compatible library. The only thing the tool should be aware of, is that there could be HTML fragments in translation strings.
The msgfilter program will let you run your translations through any program you want. It works especially well with GNU sed.
For example, to turn all your translations into uppercase (HTML is mostly case-insensitive, so this should work):
msgfilter -i django.po sed -e 's/\(.*\)/\U\1/'
The only strings in your app that have lowercase letters in them would then be the hardcoded ones.
If you really want to do faux cyrillic, you just have to write a program or script that reads Latin and outputs that, and feed that program to msgfilter instead of sed.
If your distribution has a talkfilters package, it might provide a few programs that might be useful in this specific case. All of these should work as msgfilter filters. (My personal favorite is chef. Bork bork bork!)
Haven't tried this myself yet, but found podebug tool from Translate Toolkit. Based on documentation (flipped and unicode rewrite options), this looks exactly the tool I wished for.

How do I reverse engineer Mac OS X language localisation files for natural language learning?

OK, the goal of this question is not strictly programming related but it is a question programmers can answer using programming tools, and programmers may find useful answers here. Bear with me.
I find changing the system language in Mac OS X a useful way to augment my learning of natural languages, eg French. However sometimes I find a menu item or dialog box in French that I can't understand and it's a bore to google the translation or change the system language back to English. But I know that the English translation is hidden away somewhere in the localisation file and maps somehow to the French phrase. So what I want to do is extract all the text from all the localisation files to develop a mapping of this phrase in English = that phrase in French so I can look it up easily.
I know that the localisation files are stored in something like Localizable.strings, lproj files and nib files but I can't make head or tail of how they are stored or how to work with them. I can program but I've never written anything in Xcode. All the information I can find is for Mac OS / iOS programmers to localise their software, not for hackers to extract already made localisation information.
How can I extract the foreign language information as plain text from Mac OS X system and 3rd party software localisation files? Thanks!
Strings files are easy. They're simply dictionaries serialized as property lists. The dictionary keys are used by the program to look up the given string for a particular localization. You can build a mapping from English to another language by loading both dictionaries, iterating over the keys, and using the value from the English dictionary as the key in your output and the value from the other language dictionary as the value in your output.
NIBs are harder. The build process "compiles" NIB files in to a form that's not conduicive to editing or parsing. If you have access to uncompiled NIB files then you can use ibtool --export-strings-file to dump a strings file, which you could then process as per above. If you don't then I think you may have a hard time.

Are dynamic UTIs stable?

I have files of a format that has no declared UTI, so Launch Services has assigned to it a dynamic UTI (dyn.ah62d4rv4ge81g23wsmw1a5dbte). I have no control over the UTI of these documents.
It also happens that I would like to develop a Quick Look generator for that format, and that Quick Look generators only rely on the document UTI, and will ignore any other kind of document identification present in their property list (such as the creator code and the extension).
Is it safe for me to use the dynamic UTI until the developer adds one? Are those generated by a stable algorithm that has good chances of returning the same UTI for the same files on another machine?
Yes, dynamic UTIs are stable and even include information about the file content. Actually the random looking code after 'dyn.' is a base 32 encoding of the known type information.
This article by Alastair Houghton explains that in detail. (Unfortunately this was written several months after you posted your question :-) But it might help others.)
Dynamic UTIs are apparently generated in a deterministic way that makes them viable identifiers across different Macs.
So, it's safe to use a dynamic UTI for plugin bundles.

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

Localization best practices

I'm starting to modify my app, which uses all hardcoded strings for errors, GUI, etc. I'm considering these two approaches, but let me know if there is an even better way:
-Put all string in ressource (.rc) files.
-define all strings in a file, once for each language. Use a preprocessor define to decide which strings get compiled in.
Which of these two approaches is generally prefered?
Put all the strings in resource files. Once you've done that, there's several good translation packages available. One useful thing these packages do is allow you to get translation done by somebody who doesn't program.
Remember, also, that internationalization (i18n) is a large subject, and there's a lot of things to consider. It isn't just a matter of translating strings. Do a web search on it, at the very least. You might want to read a book on it: I used International Programming for Windows by Schmitt as a guide. It's an old book from Microsoft Press, and I had to get it through a used book service; most of the more modern stuff seems to be on internationalizing .NET apps.
Without knowing more about your project (what sort of software, who the intended audience is, what sort of organization you have, what sort of budget, why you're interested in internationalization, etc.), this is about the most I can tell you.
Generally you see locale specific resource files containing strings referenced by key. Compiling different versions for different locales is a very rigid solution and will be a maintenance nightmare. Using resource files also allows the user to have fallback locales.
There's another approach of just putting strings in the source with somethign like tr(" ") and usign one of the tools that strips them out and converts them.
It works with any toolkit/GUI library.
You can mark text to be converted and text not to change (such as protocol strings or db keys).
It makes the source easier to read and search, isntead of having to lookup what IDS_MESSAGE34 means.
One problem with resource files, at least with Windows/MFC, is that you can't use the stringtable in dialogs. So you have some text in the stringtabel and some in the dialog section which you have to dela with separately.

Resources