Finding natural English text strings in codebase

Finding natural English text strings in codebase - internationalization

I'm in the process of implementing I18n in a Rails project, which involves going through a vast number of files, looking for hard-coded English text, and replacing it with calls to the Translation table. So for example if I see this
<h1 class="mb0">My Videolink Sessions</h1>
I replace it with
<h1 class="mb0"><%= t('interface.video_meetings.my_videolink_sessions') %></h1>
and make sure "My Videolink Sessions" goes into the appropriate place in the English translation table.
I've done most of it, but the remainder is inevitably going to be a PITA to track down. It occurred to me that someone might have made some tool that can scan a folder for natural english text, while ignoring code and comments.
For example, if it was to scan a folder of html.erb files, and only consider the contents of html tags (ie the part between the > and the </) then that would be a pretty good start.
Is anyone aware of such a thing? The tool itself doesn't have to be in Ruby.

Related

Microdata for dictionary : can I use yandex

I'm willing to use microdata/microformat/etc. for the part of my website which is an online dictionary. Basically I just want to tag word and definition to help search engines to grab the most important data in every page belonging to the dictionary, and maybe have Google use them as "rich snippets" in results page.
Main problem is it's hard to find dedicated vocabulary for words and definitions (no problem for recipes, movies and hotels though) and I'm not sure if I have to use the "http://schema.org/Article" tree for my lexicographic work. (To my mind, it makes sense to tag something when it's specific enough).
I have found something interesting at Yandex, for words and encyclopedia, I want to ask what to do with. See there :
https://yandex.ru/support/webmaster/microdata/what-is-microdata.xml?lang=en
https://yandex.com/support/webmaster/microdata/term-definition-markup.xml
It looks like it is very close to my request. But I'm sorry I dont know what is Yandex... will it work with Google ?
I'm asking here if that page, from Yandex, is a working model, is still on use, what are the pros and cons ? Will Google be able to use the specific vocabulary from Yandex and understand my Yandex-tagged data ? is it worth using that vocabulary for an online dictionary, or is something else I have missed of better use ?
(http://webmaster.yandex.ru/vocabularies/term-def.xml, which should be the vocabulary url, gives me a 404).
One more question, please : am I allowed to write (duplicate) the most important data in the header, something like (I believe I am, because Google microdata testing tool prooves to be able to extract the data from that code) :
<html itemscope itemtype="http://webmaster.yandex.ru/vocabularies/term-def.xml">
<meta itemprop="term" content="My term" />
<meta itemprop="definition" content="My definition" />
Just to mention I was interested, though not happy with these close discussions :
https://webmasters.stackexchange.com/questions/55073/what-meta-tag-or-structured-data-should-i-use-for-a-dictionary-web-application
schema.org and an online dictionary

Yandex is Russia's version of Google, and typically they both recognize and honor each other's search engine result implementations.
These articles you are referencing are incredibly outdated; I recommend you seeking out fresher sources, preferably where the term being defined uses the proper HTML element.
Here's the Yandex URL that is 404ing, the Wayback Machine is your friend!
Back to fresher documentation/resources, in this case the correct element as of 2016-10-05 is the <dfn> element. I know you want added semantics, but semantics is the proper place to start, and I'd follow that up by marking the entire dictionary up within a Definition List element, and placing the definition wrapped in the definition element into the <dt>, and the definition's of the term in the corresponding <dd>s.
I wouldn't waste time trying to find the perfect ontology here; implement [rel="tag" Microformat on all of the definitions], you can always come back and add a more desired one.
I've written a blog post about this, but a much more valuable resource is HTML5 Doctor's Glossary impementation, More importantly, view source - view-source:http://html5doctor.com/element-index/ (why stackoverflow doesn't recognize 'view-source' schema is beyond me)
More References/Resources:
Microformats Definition Examples has some very interesting ideas/code snippets
Utilizing the Underused by Semantically Awesome Definition List - Written Prior to HTML5's Redefinition of <dl> but Relevant

Internalizating content heavy pages into messagefiles seem cumbersome in play2?

Internalization in Play2 can be done with Message.get("home.title") and language files. What about when you internalizate a page full of textual content and not just one specific header or link?
For example doing Messagefile for a long page representing e.g. product info:
_First header_
Some paragraphs of text
...
_Tenth header_
Tenth paragraph and more text*
Messagefile
a)
product.info = "<many paragraphs of text including headers>"
or splitting one page into html elements
b)
product.info.h1 = "<first header>"
product.info.p1 = "<first para>"
product.info.p2 = "<2nd para>"
For me both solutions doesn't sound right. In first having a vast value for a single key seems bad convention and in latter separating a single page into dozens of keys doesn't sound good either.
Big websites often follow the convention www.site.com/en-us/product/1 of having the language in the URL. So the question is, how do i do in this way and is doing in this way a better way at all? I could easily end up not just translating to dozen languages but doing also dozen times layout changes.
I could use global codesnippets using Messagefile for elements that have a little text and doesn't change often e.g. navigation /view/global/header/somenavbar.scala.html but then i end up only having a complex folder structure.
Another way, a best practise, in Play 2 for internalization than messagefile?

Take a look to the Joscha Feth's solution in play_authenticate Java sample.
There are templates for emails in 3 languages for email confirmation, password reseting etc.
Template for each 'type' of email && each language is kept in single file ie:
_password_reset_en.scala.html
_password_reset_de.scala.html
_password_reset_pl.scala.html
_verify_email_en... etc
And for each 'type' there is an 'parent' template, which contains a condition (common Scala's match check the Tags section of template doc) which returns rendered view depending on detected language:
password_reset.scala.html
Finally, yes, at the beginning I also thought that some kind of madness, but believe me, that technique can be useful. There's field for further improvements I think. Maybe it would be better to move the language conditioning to the controller, hm I think that depends on many factors and it will be great if you'll find a time to investigate this topic.

Parsing HTML in AppleScript

What's a good way to parse HTML in AppleScript?
I haven't dabbled in AppleScript in quite some time, and even when I did it was very minimal and uninvolved, so I don't really think naturally in the language quite yet. But I need to do some string manipulation and parse some HTML (basically some simple screen scraping).
Naturally, I'd like to avoid common pitfalls of HTML parsing. However, this is a temporary script and doesn't need to be particularly robust or supportable. I really just need to scrape specific substrings (from a known starting substring to the next known character) into a file.
I've done plenty of string manipulation in C# and similar languages, but AppleScript is an interesting change of pace to say the least. Can somebody point me to some good resources (Google searches on this subject seem to have a high noise-to-signal ratio), or help me out with some sample code snippets?
The ultimate goal of what I'm doing is to take a pre-determined list of pages, open each one in Safari (I'm doing everything through tell application "Safari"), parse out links which fit a certain pattern, and store all of those links in a file. Then go through that file, open each of those links, parse out more links which fit another pattern, and store all of those links in a file.
(The site is actually owned by someone we're working with, so don't worry about me violating any terms of service or anything like that. But for reasons outside the scope of this question, I'm doing some page scraping in AppleScript.)

I can't say enough good things about Matt Neuburg's AppleScript: the Definitive Guide. Without a doubt the most complete documentation of AppleScript ever done. Matt's also one of my favorite tech writers.

I would also check out this article. It contains a tutorial on how to do this; the example provided there parses HTML data from only one source, but I think it's worth looking at.

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。

For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html

just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!

Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋　and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).

I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

Why do people use plain english as translation placeholders?

This may be a stupid question, but here goes.
I've seen several projects using some translation library (e.g. gettext) working with plain english placeholders. So for example:
_("Please enter your name");
instead of abstract placeholders (which has always been my instinctive preference)
_("error_please_enter_name");
I have seen various recommendations on SO to work with the former method, but I don't understand why. What I don't get is what do you do if you need to change the english wording? Because if the actual text is used as the key for all existing translations, you would have to edit all the translations, too, and change each key. Or don't you?
Isn't that awfully cumbersome? Why is this the industry standard?
It's definitely not proper normalization to do it this way. Are there massive advantages to this method that I'm not seeing?

Yes, you have to alter the existing translation files, and that is a good thing.
If you change the English wording, the translations probably need to change, too. Even if they don't, you need someone who speaks the other language to check.
You prep a new version, and part of the QA process is checking the translations. If the English wording changed and nobody checked the translation, it'll stick out like a sore thumb and it'll get fixed.

The main language is already existent: you don't need to translate it.
Translators have better context with a real sentence than vague placeholders.
The placeholders are just the keys, it's still possible to change the original language by creating a translation for it. Because when the translation doesn't exists, it uses the placeholder as the translated text.

We've been using abstract placeholders for a while and it was pretty annoying having to write everything twice when creating a new function. When English is the placeholder, you just write the code in English, you have meaningful output from the start and don't have to think about naming placeholders.
So my reason would be less work for the developers.

I like your second approach. When translating texts you always have the problem of homonyms. Like 'open' can mean a state of a window but also the verb to perform the action. In other languages these homonyms may not exist. That's why you should be able to add meaning to your placeholders. Best approach is to put this meaning in your text library. If this is not possible on the platform the framework you use, it might be a good idea to define a 'development language'. This language will add meaning to the text entries like: 'action_open' and 'state_open'. you will off course have to put extra effort i translating this language to plain english (or the language you develop for). I have put this philosophy in some large projects and in the long run this saves some time (and headaches).
The best way in my opinion is keeping meaning separate so if you develop your own translation library or the one you use supports it you can do something like this:
_(i18n("Please enter your name", "error_please_enter_name"));
Where:
i18n(text, meaning)

Interesting question. I assume the main reason is that you don't have to care about translation or localization files during development as the main language is in the code itself.

Well it probably is just that it's easier to read, and so easier to translate. I'm of the opinion that your way is best for scalability, but it does just require that extra bit of effort, which some developers might not consider worth it... and for some projects, it probably isn't.

There's a fallback hierarchy, from most specific locale to the unlocalised version in the source code.
So French in France might have the following fallback route:
fr_FR
fr
Unlocalised. Source code.
As a result, having proper English sentences in the source code ensures that if a particular translation is not provided for in step (1) or (2), you will at least get a proper understandable sentence than random programmer garbage like “error_file_not_found”.
Plus, what do you do if it is a format string: “Sorry but the %s does not exist” ? Worse still: “Written %s entries to %s, total size: %d” ?

Quite old question but one additional reason I haven't seen in the answers yet:
You could end up with more placeholders than necessary, thus more work for translators and possible inconsistent translations. However, good editors like Poedit or Gtranslator can probably help with that.
To stick with your example:
The text "Please enter your name" could appear in a different context in a different template (that the developer is most likely not aware of and shouldn't need to be). E.g. it could be used not as an error but as a prompt like a placeholder of an input field.
If you use
_("Please enter your name");
it would be reusable, the developer can be unaware of the already existing key for an error message and would just use the same text intuitively.
However, if you used
_("error_please_enter_name");
in a previous template, developers wouldn't necessarily be aware of it and would make up a second key (most likely according to a predefined wording scheme to not end up in complete chaos), e.g.
_("prompt_please_enter_name");
which then has to be translated again.
So I think that doesn't scale very well. A pre-agreed wording scheme of suffixes/prefixes e.g. for contexts can never be as precise as the text itself I think (either too verbose or too general, beforehand you don't know and afterwards it's difficult to change) and is more work for the developer that's not worth it IMHO.
Does anybody agree/disagree?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio