we are looking to translate images found in pdf documents from different languages to english.
they are scanned images and many times have tables or some structure in them.. we would like to translate to English but preserve the structure of document as much possible. Hence just a pure text based translation doesn't suffice.
we saw the Google translate app on Android which seems to do something similar with photos on phone..is there a Google cloud api which does the same?
In order to do this over the Google cloud , which api should we use, can you point us to the api an documentation that does this...
thanks
Using Google Cloud products, you can achieve this using an OCR to extract text and translate API to translate the text to English.
I suggest to use Document AI for OCR since the API is designed to parse forms and tables. You can check Document AI Table parsing and Document AI Document parsing for examples on how to use the API. Using the extracted text, you can use Translate API to translate the extracted text.
High level steps:
Use Document AI to extract data from pdf files
Use Translate API to translate the extracted data to English
Related
We develop and maintain a large number websites which have used the 'old' translate widget for quite some time. Recently, we've undertaken an effort to make all these sites ADA compliant. As it turns out, the widget's implementation is NOT ADA compliant and, it's being deprecated anyway, so our strategy is to move forward and implement the Cloud Translation API.
Many of the site pages are quite large and contain a lot of markup within the body. The body of most site's home pages is in the vicinity of 20KB. Other site pages are probably somewhat smaller. So, rather than doing a POST to an endpoint on the server which would, in turn, post to the api and then have to return the content to the browser, we believe the correct approach is to access the api directly from the browser and clearly, if we were to post the html content of the body, the api should return the body with the markup intact with the translated text.
The only example we've been able to find shows code with a non-ajax $.get(...) translating a short text string. We're wondering if there might be other examples out there which more closely address what we're trying to accomplish.
One other side note: removing the markup from one of these 20KB bodies results in a reduction in size to a bit over 5KB, so potentially doing this could result in a significant cost savings for our clients. If we were to do this by creating an array of strings to translate as part of the post, is it possible to instruct the api to do a batch translate, which would allow us to replace the original strings with the translated ones.
Right now the only available batch requests for translations would be this [1]. This requires the use of cloud storage, where the files should be and where the translated files go. As per your explanation, I am unsure if this could be of use for you.
I have found this post [2] which has a workaround that may be of use for you if it is possible for you to concatenate what needs to be translated. Basically, the workaround would be creating a string which is a concatenation of the strings that need to be translated and split it once it is translated based on a delimiter value.
[1] https://cloud.google.com/translate/docs/advanced/batch-translation
[2] Bulk translation of a big set of records via google translate
I am trying to convert English to Hindi via Google's API but I also need the English translation of the Hindi string.
To illustrate, if I convert
"a quick brown fox...."
to Hindi , it reads
"फुर्तीली भूरी लोमड़ी आलसी कुत्ते के उपर से कूद गई।"
But if you look at the web interface, Google also translates the Hindi version as
"phurtilee bhoori lomdi ..".
This doesn't show up in the response format of Translate API. I tried searching all their docs but this is all I got https://cloud.google.com/translate/docs/reference/translate#translatetextresponsetranslation and it just has a translated text in the response.
Google Translation API does not currently offer phonetic translation, despite being available in the web interface.
You can file a request for that feature to be included in the API by following the procedure explained in this forum where your same question is made.
I was using ruby client of Google Cloud Vision, to extract the vehicle information on Automobile Original Titles.
Observations:
When I used the client API, i was getting 171 words.
But, when I used the google's API demo here: https://cloud.google.com/vision/, I got 459 words. It has much of the information I was looking for.
Can anyone please explain, how to get the most out of the API ?
I found the answer to my question,
thanks to #marlon-giona.
I was referring to the post: Google Vision API text detection strange behaviour - Javascript
When I used the image.document to extract dense text, I got the exact words I was looking for
I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??
It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.
The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.
I'm curious to know how Market Samurai, Long Tail Pro and other software handle retrieving the top 10 Google search results and not running into limits. It appears that these software packages use the users own Google account. Google Custom Search limits users to 100 queries per day (the free limit) but people tend to do keyword research on hundreds or even thousands of keywords per day and don't pay any additional amounts to Google.
Are they paying extra for this service, are they using a different API (perhaps the Adwords API?) or are they scraping the Google search results page (violation of TOS)? Really would like to know! Thanks.
i have done this in one of my project (in java).
this is very simple, in java there is one library call JSoup by using this library you can send get request to google, for example:
https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=<your url encoded search term>
this will return you an HTML code of google search result with your own term.
using Jsoup u can find specific HTML tag with specific class or id. this concept helps you to extract url link, title and description from google search result.
for working example check here, in that example you can extract google serach result links with custom search term.
i hope this will help you.