Google Cloud Translation from Russian to English does not work when Russian is not in Russian alphabet - google-translation-api

I am using v3 Google Cloud Translate API (projects.translateText documentation). I have noticed that when I translate from a language with one alphabet (ie. Russian) to a language with another alphabet (ie. English) using characters from the second alphabet, that the API is not translating the text in the response. Instead it is returning the text in the second alphabet but is successfully detecting the source language.
I found this question which I think is maybe the same and is saying to use the spell check API which seems disappointing.
Examples
1. Requesting Russian word(s) in Russian alphabet to be translated to English.
Request
{
"targetLanguageCode": "en",
"contents": ["Привет"],
}
Response
{
"translations": [
"translatedText": "hello",
"detectedLanguageCode": "ru"
]
}
This returns as expected. It both detects Russian and translates to English.
2. Requesting Russian word(s) in English alphabet to be translated to English.
Request
{
"targetLanguageCode": "en",
"contents": ["Privet"],
}
Response
{
"translations": [
"translatedText": "Privet",
"detectedLanguageCode": "ru"
]
}
This does not return what I would expect. It successfully detects the language as Russian but fails to translate the text to English. When I use the Google Translate website and do the same thing it works as I would expect the API to work. Translation Example.
I Have Tried
Setting the source language explicitly to "ru" instead of letting it auto detect
Questions
Is this intended behavior?
Is there some other way to get the API to return the translation?

Related

How to differ Chinese with GetLocaleInfo?

I want to get an ISO 639-1 language string from an LCID. The problem is that 2052 (Simplified Chinese) and 1028 (Traditional Chinese) both return zh (Chinese) instead of zh-CN and zh-TW.
The code I use is
WCHAR locale[8];
GetLocaleInfoW(lcid, LOCALE_SISO639LANGNAME, locale, 8);
Is there a way to get the right code?
ISO 639-1 specifies 2-letter language names, so GetLocaleInfo() correctly returns "zh" for both Simplified and Traditional Chinese - they are not differentiated in the ISO 639-1 spec.
A call with LOCALE_SNAME instead always returns a string also containing the sub-tag, eg "de-DE" or "de-AT".
Everything else, for example a 2-letter tag for "most" languages and 4-letter one (xx-YY) for some "exceptions" (like Chinese - and which other ones?), is something custom and would therefore require custom code.

Achieving numeric and special character internationalization in angularjs application

How to implement internationalization in angularjs?
How to achieve multi language support including numeric(0 - 9) and all the special characters like - .,#$& etc. in angularjs application?
For exapmle - Suppose user has selected Chinese as the preferred language from application settings.
So I need to display him numeric data eg - $123,23.01 in Chinese.
Also here in $123,23.01 after $123 I have comma(,) but in Chinese the separator like comma(, and .) are different.
If anyone could share fiddle link of working copy or any clue, it will be a great help.
Thank you .. :)
I could solve it using JavaScript's native method number.toLocalString.
var number = 123456.789;
// request a currency format
console.log(number.toLocaleString('de-DE', { style: 'currency', currency: 'EUR' }));
// → 123.456,79
€

How to search emoticon/emoji in elasticsearch?

I am trying to search emoticon/emoji containing text in elasticsearch. Earlier, I have inserted tweets in ES. Now I want to search for example smile or sad faces related tweets. I tried the following
1) used equivalent of unicode values of smile, but didn't work. No results were returned.
GET /myindex/twitter_stream/_search
{
"query": {
"match": {
"text": "\u1f603"
}
}
}
How to set up emoji search in elasticsearch? Do, I have to encode raw tweets before ingesting into elasticsearch? What would be the query ? Any experienced approaches? Thanks.
The specification explain how to search for emoji:
Searching includes both searching for emoji characters in queries, and
finding emoji characters in the target. These are most useful when
they include the annotations as synonyms or hints. For example, when
someone searches for ⛽︎ on yelp.com, they see matches for “gas
station”. Conversely, searching for “gas pump” in a search engine
could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would
expect a search for ⛽︎ to result in matches for “Tankstelle”.
You can keep the real unicode char, and expand it to it annotation in each language you aim to support.
This can be done with a synonym filter. But Elasticsearch standard tokenizer will remove the emoji, so there is quite a lot of work to do:
remove emoji modifier, clean everything up;
tokenize via whitespace;
remove undesired punctuation;
expand the emoji to their synonyms.
The whole process is described here: http://jolicode.com/blog/search-for-emoji-with-elasticsearch (disclaimer: I'm the author).
The way I have seen emoticons work is actually a string is stored in place of there image counterparts when you are storing them in a database. For eg. A smile is stored as :smile:. You can verify that in your case. If this is the case, you can add a custom tokenizer which does not tokenize on colons so that an exact match for the emoticons can be made. Then while searching you just need to convert the emoticon image in search to appropriate string and elasticsearch will be able to find it. Hope it helps

Mac computers aren't processing mailto: links correctly when they have // in them (mailto://)

Sorry for the question title, it's a little difficult to phrase in my opinion. Here is the full question:
The WYSIWYG HTML editor we use on our websites includes a // in the mailto: link when inserted into the text editor box (mailto://). We are a webfirm and use this editor on many, many websites. For example, all the mail links inserted appear like this:
Text Here
We just noticed this morning that Windows computers do not include the // in the To: field when clicked regardless of the email client it's opened with. It will include the email as normal (email#domain.com).
However, Mac computers are including the // though, so whenever someone tries to send an email using these links, it's trying to email //email#domain.com - which isn't delivering, because obviously it's an invalid format with the //s.
Does anyone have any knowledge to why this is happening? The WYSIWYG editor we are using is obout. If we have to go back and remove these // from every single website we've built, it would be a tremendous task. I'm just wondering why Macs seem to not process the link correctly, while Windows computers do.
The Macs are processing the link correctly. Windows is incorrectly removing data and your editor is incorrectly encoding the data.
The mailto: URL scheme is defined by RFC 2368. It defines it as:
mailtoURL = "mailto:" [ to ] [ headers ]
to = #mailbox
headers = "?" header *( "&" header )
header = hname "=" hvalue
hname = *urlc
hvalue = *urlc
"#mailbox" is as specified in RFC 822 [RFC822]. This means that it
consists of zero or more comma-separated mail addresses, possibly
including "phrase" and "comment" components. Note that all URL
reserved characters in "to" must be encoded: in particular,
parentheses, commas, and the percent sign ("%"), which commonly occur
in the "mailbox" syntax.
There is no provision for removing characters such as /.

How does Facebook encode emoji in the json Graph API?

Does anyone know how Facebook encodes emoji with high-surrogate pairs in the Graph API?
Low surrogate pairs seem fine. For example, ❤️ (HEAVY BLACK HEART, though it is red in iOS/OSX, link to image if you can't see the emoji) comes through as \u2764\ufe0f which appears to match the UTF-16 hex codes / "Formal Unicode Notation" shown here at iemoji.com.
And indeed, in Ruby when parsing the JSON output from the API:
ActiveSupport::JSON.decode('"\u2764\ufe0f"')
you correctly get:
"❤️"
However, to pick another emoji, 💤 (SLEEPING SYMBOL, link to image here. Facebook returns \udbba\udf59. This seems to correspond with nothing I can find on any unicode resources, e.g., for example this one at iemoji.com.
And when I attempt to decode in Ruby using the same method above:
ActiveSupport::JSON.decode('"\udbba\udf59"')
I get:
"󾭙"
Any idea what's going on here?
Answering my own question though most of the credit belongs to #bobince for showing me the way in the comments above.
The answer is that Facebook encodes emoji using the "Google" encoding as seen on this Unicode table.
I have created a ruby gem called emojivert that can convert from one encoding to another, including from "Google" to "Unified". It is based on another existing project called rails-emoji.
So the failing example above would be fixed by doing:
string = ActiveSupport::JSON.decode('"\udbba\udf59"')
> "󾭙"
fixed = Emojivert.google_to_unified(string)
> "💤"

Resources