Translate from a language to English on the console - bash

I work with in a company with various other languages to my own (English) and so I use https://translate.google.com a reasonable amount, but as I am on the terminal a lot, I would find a lot of convenience in being able to do that there than having to open a new google tab. The URL structure is trivial, and this works if put into any browser https://translate.google.com/?sl=fr&tl=en&text=bonjour&op=translate, replace fr by any source language and en by any target language and bonjoun%20mon%20ami by any word/phrase. Ideally, I would like 2x functions in bash:
tt (translate to), tt <target-lang> <English word or phrase to translate to target-lang>
tf (translate from), tf <source-lang> <word or phrase to translate to English>
I have tried for a few days without success with lynx, elinks etc and many searches on commandlinefu and other sites (e.g. https://www.commandlinefu.com/commands/matching/translate-english/dHJhbnNsYXRlIGVuZ2xpc2g=/sort-by-votes), but not found the trick to returning the translated text. Are Google blocking things somehow, and is there a workaround - surely some tool (lynx, elinks, links2) can resolve the text being sent back when we hit the URL, and then we can extract just the translated text using sed, cut, grep etc?
If this is being blocked by cookies or some sign-on requirements, are there alternative console tools or sites to Google Translate that would allow other translation services?

Various translation services have an API, Google Translate has an API, Deepl has an API. I find some are more accurate than others, but this is a matter of personal preference.
https://www.deepl.com/docs-api
https://cloud.google.com/translate/docs/reference/rest/v2/translate
If you want to use it from shell, it is easy enough to cobble a small bash script with curl and jq to process the JSON responses, or better, use Python or Perl which supports all these operations natively.

Related

Using Google's API to split string into words?

I'm trying to figure out which API I should use to get Google to intelligently split a string into words.
Input:
thequickbrownfoxjumpsoverthelazydog
Output:
the quick brown fox jumps over the lazy dog
When I go to Google Translate and input the string (with auto-detect language) and click on the "Listen" icon for Google to read out the string, it breaks up the words and reads it out correctly. So, I know they're able to do it.
But what I can't figure out is if it's the API for Google Translate or their Text-To-Speech API that's breaking up the words. Or if there's any way to get those broken up words in an API response somewhere.
Does anyone have experience using Google's APIs to do this?
AFAIK, there isn't an API in Google Cloud that does that specifically, although, it looks like when you translate text using the Translation API it is indeed parsing the concatenated words in the background.
So, as you can't use it with the same source language as the target language, what you could do is translate to any language and then translate back to the original language. This seems a bit overkill though.
You could create a Feature Request to ask for such a feature to be implemented in the NLP API for example.
But, depending on your use case, I suppose that you could also use the method suggested in this other Stackoverflow Answer that uses dynamic programming to infer the location of spaces in a string without spaces.
Another user even made a pip package named wordninja (See second answer on the same post) based on that.
pip3 install wordninja to install it.
Example usage:
$ python
>>> import wordninja
>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Converting from google translate widget to cloud api using javascript and ajax

We develop and maintain a large number websites which have used the 'old' translate widget for quite some time. Recently, we've undertaken an effort to make all these sites ADA compliant. As it turns out, the widget's implementation is NOT ADA compliant and, it's being deprecated anyway, so our strategy is to move forward and implement the Cloud Translation API.
Many of the site pages are quite large and contain a lot of markup within the body. The body of most site's home pages is in the vicinity of 20KB. Other site pages are probably somewhat smaller. So, rather than doing a POST to an endpoint on the server which would, in turn, post to the api and then have to return the content to the browser, we believe the correct approach is to access the api directly from the browser and clearly, if we were to post the html content of the body, the api should return the body with the markup intact with the translated text.
The only example we've been able to find shows code with a non-ajax $.get(...) translating a short text string. We're wondering if there might be other examples out there which more closely address what we're trying to accomplish.
One other side note: removing the markup from one of these 20KB bodies results in a reduction in size to a bit over 5KB, so potentially doing this could result in a significant cost savings for our clients. If we were to do this by creating an array of strings to translate as part of the post, is it possible to instruct the api to do a batch translate, which would allow us to replace the original strings with the translated ones.
Right now the only available batch requests for translations would be this [1]. This requires the use of cloud storage, where the files should be and where the translated files go. As per your explanation, I am unsure if this could be of use for you.
I have found this post [2] which has a workaround that may be of use for you if it is possible for you to concatenate what needs to be translated. Basically, the workaround would be creating a string which is a concatenation of the strings that need to be translated and split it once it is translated based on a delimiter value.
[1] https://cloud.google.com/translate/docs/advanced/batch-translation
[2] Bulk translation of a big set of records via google translate

Automatic gettext translation generator for testing (pseudolocalization)

I'm currently in process of making site i18n-aware. Marking hardcoded strings as translatable.
I wonder if there's any automated tool that would let me browse the site and quickly see which strings are marked and which still aren't. I saw a few projects like django-i18n-helper that try to highlight translated strings using HTML facilities, but this doesn't work well with JavaScript.
So I thought FДЦЖ CУЯILLIC, 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯 or ʇxǝʇ uʍop-ǝpısdn (or something along those lines) should do the trick. Easy to distinguish visually, still readable, yet doesn't depend on any rich text formatting besides Unicode support.
The problem is, I can't find any readily-available tool that'd eat gettext .po/.pot file(s) and spew out such translation. Still, I think the idea is pretty obvious, so there must be something out there, already.
In my case I'm using Python/Django, but I suppose this question applies to anything that uses gettext-compatible library. The only thing the tool should be aware of, is that there could be HTML fragments in translation strings.
The msgfilter program will let you run your translations through any program you want. It works especially well with GNU sed.
For example, to turn all your translations into uppercase (HTML is mostly case-insensitive, so this should work):
msgfilter -i django.po sed -e 's/\(.*\)/\U\1/'
The only strings in your app that have lowercase letters in them would then be the hardcoded ones.
If you really want to do faux cyrillic, you just have to write a program or script that reads Latin and outputs that, and feed that program to msgfilter instead of sed.
If your distribution has a talkfilters package, it might provide a few programs that might be useful in this specific case. All of these should work as msgfilter filters. (My personal favorite is chef. Bork bork bork!)
Haven't tried this myself yet, but found podebug tool from Translate Toolkit. Based on documentation (flipped and unicode rewrite options), this looks exactly the tool I wished for.

Captcha for Japanese and Chinese?

Normally I use Recaptcha for all captcha purposes, but now I'm building a website that is translated into Chinese and Japanese, among other languages. I'd like to make the captcha as accessible to those users as possible. Even if they can read and type English characters (which is not necessarily the case), often times even I as an English-speaker have had trouble figuring out what the word in Recaptcha has to be.
One good solution I've seen (from Google) is to use numbers instead of text. Are there other good solutions? Is there a reliable free captcha service out there such as Recaptcha that offers this option?
The Chinese and Japanese both use a keyboard with Latin characters on. The Chinese input their 1000s of characters via Pinyin (Romanized Chinese) and so they are very familiar with all the same letters that you and I are. Therefore, whatever you are using for English speaking people can also be used for them.
PS - I know this is an answer to an old post, but I'm hoping this answer will help anyone who comes here with the same question.
I have encountered the same problem in the past, I resolved the issue by using the following CAPTCHA which uses a numerical validation:
http://www.tipstricks.org/
However, this may not be the best solution for you, so here is an extensive list of different CAPTCHAs you might want to consider (most of them are text based, but some use alternative methods such as numerical expressions):
http://captcha.org/
Hope this helps

Localization best practices

I'm starting to modify my app, which uses all hardcoded strings for errors, GUI, etc. I'm considering these two approaches, but let me know if there is an even better way:
-Put all string in ressource (.rc) files.
-define all strings in a file, once for each language. Use a preprocessor define to decide which strings get compiled in.
Which of these two approaches is generally prefered?
Put all the strings in resource files. Once you've done that, there's several good translation packages available. One useful thing these packages do is allow you to get translation done by somebody who doesn't program.
Remember, also, that internationalization (i18n) is a large subject, and there's a lot of things to consider. It isn't just a matter of translating strings. Do a web search on it, at the very least. You might want to read a book on it: I used International Programming for Windows by Schmitt as a guide. It's an old book from Microsoft Press, and I had to get it through a used book service; most of the more modern stuff seems to be on internationalizing .NET apps.
Without knowing more about your project (what sort of software, who the intended audience is, what sort of organization you have, what sort of budget, why you're interested in internationalization, etc.), this is about the most I can tell you.
Generally you see locale specific resource files containing strings referenced by key. Compiling different versions for different locales is a very rigid solution and will be a maintenance nightmare. Using resource files also allows the user to have fallback locales.
There's another approach of just putting strings in the source with somethign like tr(" ") and usign one of the tools that strips them out and converts them.
It works with any toolkit/GUI library.
You can mark text to be converted and text not to change (such as protocol strings or db keys).
It makes the source easier to read and search, isntead of having to lookup what IDS_MESSAGE34 means.
One problem with resource files, at least with Windows/MFC, is that you can't use the stringtable in dialogs. So you have some text in the stringtabel and some in the dialog section which you have to dela with separately.

Resources