Text whitespaces correcter - whitespace

The basic idea of what I need is whitespace correction (but more things could be appreciated too!): I am in a park ( with her , not him ) . => I am in a park (with her, not him).
Whitespace correction could be regexp'ed, but I need to have all the language specific rules (it would be nice to have that sorted out in a library!). Actually I need to do that for French text, and punctuation spacing rules are different than in English for example.
I don't know if NLTK (python) can help me doing that for example.

Maybe you can ask the native speakers for more help but a good and complete French Grammar book might suffice.

Related

Translating for websites - English to Chinese (Simplified) - Characters or written?

My question is this:
Should Chinese characters be used or should the translations be spelled out?
delete > 删除
or
delete > Shānchú
You should test your translated website with users of the destination language, and use the terms and spelling which they prefer. You should have a translator who is skilled in the source and destination language perform the translation, and accept the terms and spelling which they suggest.
I am not a user of websites in Chinese, nor a translator. But from what I understand of the language, Chinese should certainly be written in characters (e.g. 删除). The Latin script spelling (Shānchú) is used by those who do not read the Chinese script, as a way to understand the sound of the words.

How to implement fuzzy search for Chinese pinyin and Japanese romaji?

I have some data in Chinese and Japanese, and I want it possible to search by their romanizations (Pinyin for Chinese, Romaji for Japanese). Assume that the romanizations are already provided, separated by syllables.
eg. the text "示例文本", which romanizes to ["shi", "li", "wen", "ben"].
Users should be able to match this by typing
whole syllables, with or without space, eg. shi li wen ben or shiliwenben
initials or first few letters of syllables, eg. shlwb or slwb
they might also type only part of the string, eg. wenben or wb (these examples correspond to the last two syllables of the text above).
Is there an elegant way of implementing this?
(note: I did not specify any programming language in this question, because I want to implement this in different languages. If your response is language-specific or requires specific libraries, please make it clear. Thank you!)

Captcha for Japanese and Chinese?

Normally I use Recaptcha for all captcha purposes, but now I'm building a website that is translated into Chinese and Japanese, among other languages. I'd like to make the captcha as accessible to those users as possible. Even if they can read and type English characters (which is not necessarily the case), often times even I as an English-speaker have had trouble figuring out what the word in Recaptcha has to be.
One good solution I've seen (from Google) is to use numbers instead of text. Are there other good solutions? Is there a reliable free captcha service out there such as Recaptcha that offers this option?
The Chinese and Japanese both use a keyboard with Latin characters on. The Chinese input their 1000s of characters via Pinyin (Romanized Chinese) and so they are very familiar with all the same letters that you and I are. Therefore, whatever you are using for English speaking people can also be used for them.
PS - I know this is an answer to an old post, but I'm hoping this answer will help anyone who comes here with the same question.
I have encountered the same problem in the past, I resolved the issue by using the following CAPTCHA which uses a numerical validation:
http://www.tipstricks.org/
However, this may not be the best solution for you, so here is an extensive list of different CAPTCHAs you might want to consider (most of them are text based, but some use alternative methods such as numerical expressions):
http://captcha.org/
Hope this helps

Displaying language lists: Which language should I use? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Every once in a while I'm confronted with displaying a list of available languages, and each and every time I ask my self:
Is it better to display the language in:
the currently selected language
English
in the language according to the button/list item
Examples:
English
German
French
or
English
Deutsch
Français
Is there any convention on which one should be used, is more polite or better in any other way? Are there other options?
I would say it's best to display the language in "its own language" (option #3). You can not necessarily expect the user to know the currently selected language, nor expect him to know English.
What's tricky is how to display the "Select your language" button in a language neutral way. I usually go for a flag indicating the current language since that tends to get the message across eventhough there's not always a 1:1 mapping between country and languages.
I definitely think you should display in the language that matches the item in the button list.
Reasons:
If it's not the language you're interested in, you won't mind if you don't understand it, as long as you can find your own language.
Think about the last time you called customer service. How many times have you heard something like, "Para Espanol, marque dos"? It's very common, accepted practice to mix different languages in one UI (whether visual or audible).
Think about how you'd feel if you went to a Spanish site, and you couldn't find your language under "E". Maybe, eventually, you'd notice "Ingles", and think it probably translated to "English", but it's definitely better to save the user the trouble of translating and mentally alphabetizing.
The standard (in both senses of the word, i.e. what is actually used in the real world, and what the IETF/W3C/ISO says) is to use ISO 639-1 Alpha-2 language codes. Maybe augmented with either the full name of the language in English, the language itself, a romanic transliteration of the name in the language itself or any combination thereof.
So, to keep with your example:
[de] German - Deutsch
[en] English
[fr] French - Français
[ja] Japanese - 日本語 (Nihongo)
Two options, first the name of the language in the selected locale or English, then the name of the language in itself between parens, or the other way around, e.g.:
English
French (Français)
German (Deutsch)
Spanish (Español)
or
English
Français (French)
Deutsch (German)
Español (Spanish)
English language name:
Pros:
Predictable sorting.
No need to think about different text flows.
Cons:
Users who doesn't speak English might have ha harder time finding their language.
If the rest of the application is translated, it might look sloppy or grammatically wrong: Ditt språk är English/votre langue est English.
Language in its own name:
Pros:
Easier for the non-English speaker.
You have to think about encoding and text flow; A useful exercise. :-)
Cons:
Harder to navigate if the user is used to English or has her mind set on finding an English name.
You have to consider all language variants.
What is right really depends on the rest of your application. You might want to consider having all language names translated to all languages. If english is choosen, then you get to pick from:
English
Swedish
French
If Swedish:
Engelska
Svenska
Franska
...and French:
Anglais
Suédois
Français
But then the translantion problem has turned from O(n) to O(n^2), which might be acceptable depending on what your current value of n is.
EDIT
As deceze points out. you will also have to handle the case when a user accidentally switches to a language she doesn't understand, and provide a way back - for example by always including a few major languages.
I find it harder to find "Magyar" in a list of languages.
Because there are languages with non-latin character set, this is not a simple first-letter-lookup, as I lose focus when I first meet one of these.
Where should I look? At 'M' - Magyar? But where is M? EDIT: M in the (current language's) alphabet, not on the keyboard.
Have a look at this (from Wikipedia):
Български - I know, this is Bulgarian, but
བོད་ཡིག - what is this?
Bosanski
Català
Česky
Dansk
Deutsch
Ελληνικά
I would prefer something like this:
A...
B...
C...
.
.
Hungarian (Magyar)
If the UI was Japanese, I would ctrl+f-ing "Magyar", though.
Whatever you do don't use the IP location to set the language.
Google is very annoying about this -- when logging on from a new location I get google in the local language and script. This is really annoying particularly, anywhere southeast of Croatia.
The worse offender though is Microsoft. When trying to purchase software thier servers keep switching languages depending on your location and in many cases makes it impossible pay for anything by Credit Card as the addresses and zip codes etc. are validated in the local format and not where your credit card was actually issued. ( By the way MS the first four digits of a credit card number indicate the issuing institution which is tied to a particular country so its not rocket science to work out a UK postcode format is required rather than say a six digit german ZIP code.
Use country-flags in combination with the language name in that language (Deutsch, Francais, Nederlands, ...).
I don't know about any programming related conventions about this but i would prefer to see the name of a language in its own language.
For example:
English
Türkçe
Deutsch
Have a look at your Regional Settings.
This is how Microsoft implemented it. Seems like your version 1.
alt text http://www.freeimagehosting.net/uploads/1c14f9f60d.jpg

Validating Kana Input

I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?
It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.
Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)
oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.

Resources