Captcha for Japanese and Chinese? - internationalization

Normally I use Recaptcha for all captcha purposes, but now I'm building a website that is translated into Chinese and Japanese, among other languages. I'd like to make the captcha as accessible to those users as possible. Even if they can read and type English characters (which is not necessarily the case), often times even I as an English-speaker have had trouble figuring out what the word in Recaptcha has to be.
One good solution I've seen (from Google) is to use numbers instead of text. Are there other good solutions? Is there a reliable free captcha service out there such as Recaptcha that offers this option?

The Chinese and Japanese both use a keyboard with Latin characters on. The Chinese input their 1000s of characters via Pinyin (Romanized Chinese) and so they are very familiar with all the same letters that you and I are. Therefore, whatever you are using for English speaking people can also be used for them.
PS - I know this is an answer to an old post, but I'm hoping this answer will help anyone who comes here with the same question.

I have encountered the same problem in the past, I resolved the issue by using the following CAPTCHA which uses a numerical validation:
http://www.tipstricks.org/
However, this may not be the best solution for you, so here is an extensive list of different CAPTCHAs you might want to consider (most of them are text based, but some use alternative methods such as numerical expressions):
http://captcha.org/
Hope this helps

Related

Search using Xapian Omega - with Wild Cards or Regular Expressions

We are confronting different search engines for our research
archives and having browsed the Xapian-Omega documentation, we
decided to try it out since the Omega option appears to be an
appropriate solution with several interesting search options.
We installed Xapian-Omega on a Linux Server (Deb 7) and tested
the setup with success. However we are unsure as to how one can
employ or perhaps even enable the use of Wild Cards or Regular
Expressions with Xapian-Omega.
We read that for Xapian one has to enable the Wild Card option
"QueryParser flags"
Could someone clarify this ?
ie. explain with or indicate a page with an example or two.
But we did not see much information regarding examples with Omega
CGI and although this latter runs well, wild card options
(such as * for the general wild card and ? as a single character),
do not seem to work as expected by default and they would be
useful, even though stemming and substrings etc may be functional.
Eg: It would be interesting to be able to employ standard simple
wild char searches with a certain precision such as :
medic* for medicine medical medicament
or with ? for single characters
Can Regexp be recognised with Omega ?
eg : sep[ae]r[ae]te(\w+)?
or searching for structured formats such as Email or Credit Card
Numbers or certain formula types in research papers etc.
In a note from Olly Betts long ago (Dev Mailing List) regarding
this one suggestion was to grep the index file but this would
defeat the RAD advantage of Omega.
Any examples of searches using Omega with Wild Cards or Regular
Expressions would be most appreciated ... even an indication of
a page where information regarding this theme is well presented
with examples illustrating how to develop advanced searches
using Xapian alone would be most welcome (PHP or Python perhaps).
(We are not concerned for the moment about the eventual
substantial increase in the size of the index size or in the
time to index the archive)
You can enable right-wildcards (such as "medic*") in Omega using $set{flag_wildcard,1} (covered in the Omegascript documentation), which enables FLAG_WILDCARD. There's a section in the user manual on using wildcards.
Xapian doesn't provide support for regular expression searching, although in theory I believe it would be possible to support, if potentially costly (depending on the regex). It would have to run the regular expression against unstemmed terms in the database, and then feed them into the search. Where it becomes difficult is if the regex expands to a lot of terms (eg just 'a' as a regex). There's also some subtlety in making it efficient; it's easy to jump through the term list to something with a constant prefix, and you'd want to take advantage of that if possible.
For your example of sep[ae]r[ae]te(\w+)?, it sounds like you actually want a combination of spelling correction (for the a-e substitutions, which you can enable using $set{flag_spelling_correction,1}) and stemming (for the trailing letters after 'te'; Omega defaults to English stemming, but that can be changed), or either wildcard or partial match support.
If you do need regular expressions for your use case, then I'd suggest bringing it up on the xapian-discuss mailing list. Xapian has moved on since the last discussion, and I believe it would be easier to build such support now than it was then.
James Ayatt: Thank you for your answer and help, my apologies for this belated reply, a distraction with other work.
We had already seen the Omegascript page but it was not clear to us how to employ these options with the CGI interface. Also the use of * seems to be for trailing chars, is that correct ? ie not for internal groups of words eg: omeg*ipt; there are cases where the stemming option would not be sufficient. We did not see an option for single wild chars, sometimes represented by ? in certain search engines. Could you comment here ?
Regarding the use of regular expressions we had immagined that it might not be quite as simple as one could hope. The examples mentioned in the preceding post were of course simple possible uses, there are of course many more. Your comment on using the stemming option seems appropriate.
In certain cases it could be interesting to enable some type of regexp option for the extraction of text forms, such as those mentioned. The quick extractiion of such text, perhaps together with some surrounding text could be very useful.
We will certainly try your proposal with the mailing list.
Thank you again.

Is there a standard computer vocabulary for German? for Spanish?

I was given the task of coming up with shorter German words for the German version of our software.
It got me to thinking that there should be some sort of standard vocabulary for information technology somewhere. Like there "have to be" terms that most (if not all) German computer users use for what English-speakers call file, database, record, search, search terms, search hits, find and replace, delete, OCR ... you get the idea.
I found ISO 2382 on the ISO Web site, but it only seems to standardize English and French. Is there an equivalent standard for German? How about for Spanish, or for other languages?
I may suggest this book, although quite dated, was an attempt to come up with a set of standard computer terms for translating from German to English and back:
Grosses IWT-Wörterbuch der Computertechnik und der Wirtschaftsinformatik. Englisch-Deutsch. Deutsch-Englisch
I will offer up the answer, "no".
Even within English, there are not standard words to describe computer operations as you have presented them. Certainly one can "delete" a file, but they can also "erase" it, "remove" it, an (shudder) "move it to the trash can".
Instead of trying to solve the problem in the large, I suggest you solve the problem in the small. Build a glossary of commonly used German words, and whenever there is an opportunity to expand the Glossary, first look over the existing entries and do your best to reuse the current terminology.
In a way, the reason good English documentation works well is because good writers of English use a glossary like technique explicitly or implicitly. In the event that much of your documentation comes from a single source, or related set of sources, you can make a "translation map" of "when they say X, we say Y". But, even such simplifications often require native readers to re-read the translation in context, as languages are not nearly regular enough to do simple substitution without many pitfalls.
As a starting point, The Open Group (www.opengroup.org) seems to have defined glossaries as part of their work on The Open Group Architecture Framework (TOGAF), which appear to be the sort of thing I needed. For example, these document numbers and titles are taken directly from their Web site:
C148 TOGAF® 9.1 Translation Glossary: English – Hrvatski (Croatian)
C149 TOGAF® 9.1 Translation Glossary: English – Castilian Spanish
C146 TOGAF® 9.1 Translation Glossary: English – Portuguese (Portugal)
C13H TOGAF® 9.1 Translation Glossary: English – Slovak

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

Why are non-ascii characters still disliked in function / var names?

When I see a small program which is written for some students, I often see something like this: (haskell, german):
ueber = "What the haeck!"
instead of
über = "What the häck!"
As many modern languages are specified to allow non-standard charactes in declaration names via UTF-8, is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters (say for a team of german students?) or is this just a historical reason?
I know, that you should keep names in a-zA-Z_0-9 if you develop an applicaio internationally, but are there any reason for avoiding this in a "local" project?
is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters
That is certainly the main reason. Other reasons that come to mind is that many development tools, search functions, editors, parsers, documentors, code search engines etc. will not expect non-ASCII input in code.
Also, you never know where your code may be used one day! The smallest innocent school project can grow into a nice Open-Source tool that gets used around the globe one day. In that case, ASCII is the smallest common denominator, at least at the moment.
I've had to work on a project started by French developers. They had to spend quite a bit of time translating their program to English when more people joined the project. Teach your German students this lesson up front, and not only will they be able to share their code with others, they'll no longer need an über or ueber variable either.
BTW, ü is an alphabetic character. + and - are non-alphanumeric, and I'd say it's obvious why they're disliked in function names.

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.

Resources